I thought 50thousand rows in a relation was huge when I was working back home. This was a real-production site-live data. And then I saw a relation with 290 MILLION rows. WITH NO INDEX on the search column other than the pk column. OMG. Searching a simple where none-pk=yyy literally took 30 minutes, on good days. And the thing I wanted to do with this humongous relation was to choose deterministically randomized subsets (that is, same order of randomized rows every time).
The following is the progression of the query:
1. select .... from XXX where .... order by rand(seed) limit x, y;
This took waaaaaaaay too long for my purposes, which is no wonder because the whole order by rand() on the entire tuple is too expensive when anything for 290 million tuples is just too slow already.
2. select ... from XXX where ... and id in (select id from XXX order by rand()) limit x, y;
Now, this looked like a good solution for about half an hour. The speed of query was definitely faster by order of 100, but the problem was that limit x,y could not be inside the subquery (mysql limitation, nothing syntactical)... meaning the result of subquery was shuffled and randomly ordered, but it included ALL ids. So, going id in (shuffled and randomly ordered ALL id) limit x,y ended up returning y number of tuples in sequential id order, because id being the primary key, it had index built on it.
3. select ... from XXX where ... and id in (select (select id from XXX order by rand() limit x, y) as shuffledID);
This looked like a good solution, again for about half an hour. While this was in fact shuffling and limiting the result of subquery, and therefore what gets selected for the outer query was non-sequential y number of tuples, IT WAS SLOW.. similar performance as the initial try, which is no wonder as nested anything is usually a headache if you do it more than once :p
4. select id,.... from XXX where .... order by password(id) limit x, y;
This was me thinking outside the box. Password function in mysql uses hashing to generate "random" string (hexadecimal if I am not wrong) and that made me think, there are more ways to randomize! Initially I thought the performance wasn't going to cut it but out of 4 options, it was actually the best in terms of query time and randomizedness and I went with this one.
This, with memcache, index on search columns and not having that 290million tuple relation (splitted it up in the end.. denormalization for the sake of performance is not really recommended, but it was done in a way that does not require joins for queries, so I am happy with it.. just couldn't do a darn thing with that huge a table) made things better... ahhh fun small challenge this was :)
No comments:
Post a Comment