Sunday, August 30, 2009

Semester started...

FA09 has officially started and a week has been gone by already... I was back in town from Las Vegas on Wednesday Morning so maybe that is perhaps why the week seemed so short.
Had a blast in LV + Grand Canyon, definitely places that one should see at least once... What I saw of Grand Canyon must be a tiny tiny fraction of the whole thing, so I'm up for few more trips to Grand Canyon..

I felt I needed to write out my random thoughts and plans etc and have a clear list of things for my own sake.. my head seems to be in the holiday mode still, and I already have quite a lot of work on my plate.
- I'll first need to decide which course to drop, since TA+2 courses is kind of a standard for me and another course will be the extra feather that broke an elephant's back..
- Then there is the looming half marathon which I don't think I'm ready for, physically more than mentally :p don't know what to do about that at the mo, since I really want to participate but having the semester started and all, I'm finding harder to work out... wow a classic excuse I must say - during the break I was too lazy to work out (oh well that plus 8 hours a day working schedule didn't really help...) and during the semester I'm too busy to work out... dang it... I'll see what happens I guess...
- The weather can no longer be categorized as summer anymore... it is around 23C and for me I had spent 5 days in 40C heat (and before I left it was around 30C in town) that 23C is now cool and sometimes cold... Looming winter is just too scary for me and I bought a 600-fill down snowboard jacket in preparation for it.

Thursday, August 20, 2009

The time I saw 290 million tuples in a relation

I thought 50thousand rows in a relation was huge when I was working back home. This was a real-production site-live data. And then I saw a relation with 290 MILLION rows. WITH NO INDEX on the search column other than the pk column. OMG. Searching a simple where none-pk=yyy literally took 30 minutes, on good days. And the thing I wanted to do with this humongous relation was to choose deterministically randomized subsets (that is, same order of randomized rows every time).
The following is the progression of the query:

1. select .... from XXX where .... order by rand(seed) limit x, y;
This took waaaaaaaay too long for my purposes, which is no wonder because the whole order by rand() on the entire tuple is too expensive when anything for 290 million tuples is just too slow already.

2. select ... from XXX where ... and id in (select id from XXX order by rand()) limit x, y;
Now, this looked like a good solution for about half an hour. The speed of query was definitely faster by order of 100, but the problem was that limit x,y could not be inside the subquery (mysql limitation, nothing syntactical)... meaning the result of subquery was shuffled and randomly ordered, but it included ALL ids. So, going id in (
shuffled and randomly ordered ALL id) limit x,y ended up returning y number of tuples in sequential id order, because id being the primary key, it had index built on it.

3. select ... from XXX where ... and id in (select (select id from XXX order by rand() limit x, y) as shuffledID);
This looked like a good solution, again for about half an hour. While this was in fact shuffling and limiting the result of subquery, and therefore what gets selected for the outer query was non-sequential y number of tuples, IT WAS SLOW.. similar performance as the initial try, which is no wonder as nested anything is usually a headache if you do it more than once :p

4.
select id,.... from XXX where .... order by password(id) limit x, y;
This was me thinking outside the box. Password function in mysql uses hashing to generate "random" string (hexadecimal if I am not wrong) and that made me think, there are more ways to randomize! Initially I thought the performance wasn't going to cut it but out of 4 options, it was actually the best in terms of query time and randomizedness and I went with this one.

This, with memcache, index on search columns and not having that 290million tuple relation (splitted it up in the end.. denormalization for the sake of performance is not really recommended, but it was done in a way that does not require joins for queries, so I am happy with it.. just couldn't do a darn thing with that huge a table) made things better... ahhh fun small challenge this was :)

Sunday, August 16, 2009

darn.. I almost forgot I had a blog

I guess I sorta knew I'll eventually fade off from blogging.. and here I am one-per-month blogging.. Anyways, summer break is nearly over, with only a week to go before I go to Vegas and then come back to already-started Fall09. It was a good break, didn't think 3 month will ever be over but here I am :p.

So. Worked for 2 months, writing python web crawlers, working on Linux more than I ever have before, doing db back-end stuff etc.. So I learned a new language at least :)
Hanging out with friends without constant worry/stress about having to finish an MP or reading list or whatever the heck I did during the semester was so very very nice too. Some down points were the fact that sometimes the whole "outing" made me miss my old friends even more, with whom I can go to a pub and drink and goof around/talk about real stuff, whereas some people I got to know here mainly hang out to drink... Another thing is that I am sorta at the older end of age spectrum... I used to think that wouldn't be a problem, but with Korean friends, it is kinda different :S hard to explain, it's a cultural thing.

I'm kinda pumped to go back to what I only can describe as study-war. This year will be more hectic since I have decided on my thesis topic and will have to produce a decent thesis draft at least. Good (bad?) news is that I HAVE to do one more semester here, because I wasted couple of course slots trying out data mining... oh well, I'm open to getting the most of what I can from UIUC. Good news is that I secured TA position for the year (year? semester?), meaning my tuition is wavered... and this time it is not introductory java course, where I had to grade 200+ exam papers overnight, teaching classes for 4 hours a week and answer emails etc. I will be TAing for a database course that I took last semester. Sweet..
So.. new semester? bring it on!