Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?

soon_j has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by gmax (Abbot) on Jan 03, 2004 at 19:54 UTC
There are a few ways of achieving what you want. The one you are using (`SELECT ... ORDER BY RAND() LIMIT 1`) is easy and immediate. As you said, efficiency could be very poor when your table size increases. However, you can improve the execution speed if you only select the primary key column in your query. Note that this code can be used also if you want to get more than one random question at once. `my $limit = 1; my $query = qq{SELECT id FROM table ORDER BY RAND() LIMIT $limit}; my $ids = $dbh->selectcol_arrayref($query) or die "execution error ($DBI::errstr\n"; $query = qq{SELECT question, answer FROM table WHERE id in (} . (join ",", @$ids) . " ) " ;` [download] Selecting only the primary key will reduce the amount of data that the DBMS has to process. Therefore, selecting and sorting will be faster. A very fast alternative is available when your data doesn't change so often. If your data changes, say, once a day but you are querying around the clock, then you can make an auxiliary table each time you update or insert something. `mysql> create table if not exists support_tbl -> (cnt int not null auto_increment primary key, -> id int not null); mysql> truncate support_tbl; mysql> insert into support_tbl (id) select id from table order by rand +();` [download] Here you have a supporting table with primary keys in random order. Yopu can get a pageload of IDs very easily: `my $cnt =1; my $ids = $dbh->selectcol_arrayref(qq{select id from support_tbl where cnt = ?}, undef, $cnt);` [download] For your next request, you will increase $cnt according to your needs. Another possibility is calculating the random value in Perl. `# make sure our random value is not higher than # any value in the primary key my ($max) = @{$dbh->selectcol_arrayref(qq{select max(id) from table})} +; my $cnt = rand($max); my $query = qq{SELECT question, answer FROM table WHERE id = $cnt };` [download] WARNING! Although this solution is fast, there is a catch. It will only work if your data does not have any gaps in your primary key column, i.e. if you deleted any row this method may not work as expected. HTH _ _ _ _ (_\|\| \| \|(_\|>< _\|	[reply] [d/l] [select]
Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by sgifford (Prior) on Jan 04, 2004 at 08:17 UTC
I used the third technique in a recent project, but instead of using the primary key (which was a non-sequential ID number), I added a `rownum` column which gave each row a sequential number. I then selected a random value between 1 and the number of rows. The reason this is much faster than the `ORDER BY rand()` technique is that (AFAIK) in order to resolve `ORDER BY rand()`, MySQL has to go through each row, generate a random number, then see which one is the lowest. If the table is very large, that takes a long time. The problem with using a row number is that if the data changes, you can end up gaps with in numbering. If the data is generated periodically, there's no problem; just regenerate the row numbers with the rest of the data. Periodically, you can regenerate the row numbers to fill in gaps. If there's just a few missing numbers scattered here and there and you don't need great randomness, it's no big deal; instead of using `WHERE id = $rand` you'd use `WHERE id >= $rand LIMIT 1`. If you have a lot of changes, though, this technique will work poorly. Something that might work a little better with a large number of changes would be to randomly assign each record a number between, say, 1 and 100. To choose a random record, then, you would first pick a number between 1 and 100, then you would get a record randomly from the set of rows assigned that number with `ORDER BY rand()`. It would still have to evaluate multiple records, but on average it would only be 1/100 as many records as using `ORDER BY rand()` on the whole table.	[reply]
Re: Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by soon_j (Scribe) on Jan 04, 2004 at 08:56 UTC
That's a good suggestion! I believe you are right, it's a lot faster. Now I am left with one problem. In order to implement your suggestion, I only need to issue a single query from my questions table in the database. PLEASE SEE MY RESPONSE TO GMAX FOR AN IDEA. Thanks a lot! Jay	[reply]
Re: Re: Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by sgifford (Prior) on Jan 04, 2004 at 09:56 UTC
Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by soon_j (Scribe) on Jan 04, 2004 at 08:41 UTC
Thanks for the suggestions! But, here's the complete scenario. The questions table have columns: ques_id (PRIMARY KEY AUTO_INCREMENT), subject (VARCHAR), category_id (INT UNSIGNED), option_a (VARCHAR), ... option_d (VARCHAR). Another table (categories), have the following columns: category_id (PRIMARY KEY AUTO_INCREMENT), subject (VARCHAR), category (VARCHAR). First, I have to retrieve all DISTINCT subjects from the categories table. (Example, I retrieve the following distinct subjects: Biology, Astronomy,... Mathematics). Since the user configures which subjects he wants to include on his exam, then I really have to do that. So this what I did (SELECT DISTINCT subject FROM categories). Next, I have to retrieve also the categories under the selected subject. (Example, I have to retrieve the ff: categories under the subject Mathematics: Algebra, Geometry,... Calculus). So I have to do this (SELECT ques_id FROM questions WHERE subject='Subject1' and Category='Category1' ORDER BY RAND()). This routine would be repeated for each subject. After each query described above, but before starting a new query for the next subject, I pushed the random results to an array. On the next subject, the results would be added or pushed to the array previously described. This would result into an array having random ques_id's. However, though random the results are from each query, they are clustered according to subjects. Like, for instance, the first ten elements may be about Biology, the next 10 might be of Astronomy, and so forth. This is the reason why I have to randomize again the elements in the resulting array to make sure they are truly random. Since the number of subjects could vary... like if a subject Religion was added to the existing subjects (Biology, Astronomy,...), I did not issue a single query like (SELECT ques_id FROM questions WHERE (subject='Subject1' and category='Category1') OR (subject='Subject2' and category='Category2')). If only I could issue a single effective query (which I haven't had a good idea yet), your suggestion of (SELECT ques_id FROM ... ORDER BY RAND() LIMIT 1) could be extremely useful. But with this type of query, would there be a chance that the same question would be repeated? Looks like... or is MySQL smart enough to know and avoid repeated results from RAND() LIMIT 1? Your next suggestion of creating a table and dumping the randomized results into it is a good alternative that I have to try. To wrap up, is there a good way to issue a single query so that I don't have to randomize the elements of my resulting array? For example, if I am the user, and I selected the subjects: Mathematics (with the following categories: Algebra, Geometry) and Astronomy (with the categories: Solary System, Astronomers), what would be a good way of issue a query? Without doing my lousy way of doing a query for each subject? Thanks! JAY	[reply]
Re: Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by sgifford (Prior) on Jan 04, 2004 at 09:53 UTC
MySQL may repeat results if you do repeated queries with `ORDER BY rand LIMIT 1`. But it won't if you ask for all 10 questions at once---`ORDER BY rand LIMIT 10`. You may have to issue one query per subject, but so what? Unless there are dozens of subjects, the speed difference won't be noticeable.	[reply]
Re: Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by Anonymous Monk on Jan 04, 2004 at 09:00 UTC
`SELECT ques_id FROM questions q, categories c WHERE q.category_id = c.category_id AND category = 'Mathematics' AND subject in ('Calculus', 'Geometry', 'Algebra') ORDER BY RAND() LIMIT 1` [download]	[reply] [d/l]
Re: Re: Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by soon_j (Scribe) on Jan 04, 2004 at 09:10 UTC
Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by b10m (Vicar) on Jan 03, 2004 at 14:11 UTC
I'm not sure wheter this is a "better" way than the MySQL order by RAND() function, but of course you could have Perl create a "random" number (see rand) and use something like: `SELECT * FROM table WHERE id=$rand` [download] I assume that you have a column called 'id' which contains a unique number for and $rand contains the random number generated. You might want to benchmark both methods (and possibly more methods to come) and see what works best. HTH -- b10m	[reply] [d/l]
Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)? by epoptai (Curate) on Jan 03, 2004 at 15:55 UTC
You can select any number of random rows by limiting the results of your query: `SELECT question, answer FROM table ORDER BY RAND() LIMIT 1` [download] It'll be much faster and more efficient than pushing all the results into an array and dumping them into a temporary table. That seems like too much work for the task you described, unless there's a good reason for the temp table. -- perl -MO=Deparse -e"u j t S n a t o e h r , e p l r a h k c r e"	[reply] [d/l]