Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?

by soon_j (Scribe)
on Jan 03, 2004 at 13:43 UTC ( [id://318504]=perlquestion: print w/replies, xml ) Need Help??

soon_j has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I'm Jay. I have a MySQL table with a column for a question and another for the answer. The table contains several hundred rows.

What I need is to retrieve a row randomly from this table so that a question will be randomly picked and posted on a page (for an online quiz application). MySQL only offers a way of sorting the results of a query randomly. I'm avoiding this because as the table increases in size, I'll be selecting all rows then randomizing (SELECT question, answer FROM table ORDER BY RAND()).

At the moment, since I have not thought of a good algorithm, I did as what I said from above. I selected all rows then sorting the results randomly through MySQL. Each result is pushed to an array. Afterwhich I again randomized the elements in the array (because of some reasons...) then dumped them to a temporary table. From the temporary table, I then pulled my questions.

Is there a better way?

Jay
  • Comment on Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?

Replies are listed 'Best First'.
Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?
by gmax (Abbot) on Jan 03, 2004 at 19:54 UTC

    There are a few ways of achieving what you want.

    • The one you are using (SELECT ... ORDER BY RAND() LIMIT 1) is easy and immediate. As you said, efficiency could be very poor when your table size increases.

      However, you can improve the execution speed if you only select the primary key column in your query.

      Note that this code can be used also if you want to get more than one random question at once.

      my $limit = 1; my $query = qq{SELECT id FROM table ORDER BY RAND() LIMIT $limit}; my $ids = $dbh->selectcol_arrayref($query) or die "execution error ($DBI::errstr\n"; $query = qq{SELECT question, answer FROM table WHERE id in (} . (join ",", @$ids) . " ) " ;

      Selecting only the primary key will reduce the amount of data that the DBMS has to process. Therefore, selecting and sorting will be faster.

    • A very fast alternative is available when your data doesn't change so often. If your data changes, say, once a day but you are querying around the clock, then you can make an auxiliary table each time you update or insert something.

      mysql> create table if not exists support_tbl -> (cnt int not null auto_increment primary key, -> id int not null); mysql> truncate support_tbl; mysql> insert into support_tbl (id) select id from table order by rand +();

      Here you have a supporting table with primary keys in random order. Yopu can get a pageload of IDs very easily:

      my $cnt =1; my $ids = $dbh->selectcol_arrayref(qq{select id from support_tbl where cnt = ?}, undef, $cnt);

      For your next request, you will increase $cnt according to your needs.

    • Another possibility is calculating the random value in Perl.

      # make sure our random value is not higher than # any value in the primary key my ($max) = @{$dbh->selectcol_arrayref(qq{select max(id) from table})} +; my $cnt = rand($max); my $query = qq{SELECT question, answer FROM table WHERE id = $cnt };

      WARNING! Although this solution is fast, there is a catch. It will only work if your data does not have any gaps in your primary key column, i.e. if you deleted any row this method may not work as expected.

    HTH

     _  _ _  _  
    (_|| | |(_|><
     _|   
    

      I used the third technique in a recent project, but instead of using the primary key (which was a non-sequential ID number), I added a rownum column which gave each row a sequential number. I then selected a random value between 1 and the number of rows.

      The reason this is much faster than the ORDER BY rand() technique is that (AFAIK) in order to resolve ORDER BY rand(), MySQL has to go through each row, generate a random number, then see which one is the lowest. If the table is very large, that takes a long time.

      The problem with using a row number is that if the data changes, you can end up gaps with in numbering. If the data is generated periodically, there's no problem; just regenerate the row numbers with the rest of the data. Periodically, you can regenerate the row numbers to fill in gaps. If there's just a few missing numbers scattered here and there and you don't need great randomness, it's no big deal; instead of using WHERE id = $rand you'd use WHERE id >= $rand LIMIT 1. If you have a lot of changes, though, this technique will work poorly.

      Something that might work a little better with a large number of changes would be to randomly assign each record a number between, say, 1 and 100. To choose a random record, then, you would first pick a number between 1 and 100, then you would get a record randomly from the set of rows assigned that number with ORDER BY rand(). It would still have to evaluate multiple records, but on average it would only be 1/100 as many records as using ORDER BY rand() on the whole table.

        That's a good suggestion! I believe you are right, it's a lot faster.

        Now I am left with one problem. In order to implement your suggestion, I only need to issue a single query from my questions table in the database. PLEASE SEE MY RESPONSE TO GMAX FOR AN IDEA.

        Thanks a lot! Jay
      Thanks for the suggestions! But, here's the complete scenario. The questions table have columns: ques_id (PRIMARY KEY AUTO_INCREMENT), subject (VARCHAR), category_id (INT UNSIGNED), option_a (VARCHAR), ... option_d (VARCHAR). Another table (categories), have the following columns: category_id (PRIMARY KEY AUTO_INCREMENT), subject (VARCHAR), category (VARCHAR).

      First, I have to retrieve all DISTINCT subjects from the categories table. (Example, I retrieve the following distinct subjects: Biology, Astronomy,... Mathematics). Since the user configures which subjects he wants to include on his exam, then I really have to do that. So this what I did (SELECT DISTINCT subject FROM categories).

      Next, I have to retrieve also the categories under the selected subject. (Example, I have to retrieve the ff: categories under the subject Mathematics: Algebra, Geometry,... Calculus). So I have to do this (SELECT ques_id FROM questions WHERE subject='Subject1' and Category='Category1' ORDER BY RAND()). This routine would be repeated for each subject.

      After each query described above, but before starting a new query for the next subject, I pushed the random results to an array. On the next subject, the results would be added or pushed to the array previously described.

      This would result into an array having random ques_id's. However, though random the results are from each query, they are clustered according to subjects. Like, for instance, the first ten elements may be about Biology, the next 10 might be of Astronomy, and so forth. This is the reason why I have to randomize again the elements in the resulting array to make sure they are truly random.

      Since the number of subjects could vary... like if a subject Religion was added to the existing subjects (Biology, Astronomy,...), I did not issue a single query like (SELECT ques_id FROM questions WHERE (subject='Subject1' and category='Category1') OR (subject='Subject2' and category='Category2')). If only I could issue a single effective query (which I haven't had a good idea yet), your suggestion of (SELECT ques_id FROM ... ORDER BY RAND() LIMIT 1) could be extremely useful. But with this type of query, would there be a chance that the same question would be repeated? Looks like... or is MySQL smart enough to know and avoid repeated results from RAND() LIMIT 1?

      Your next suggestion of creating a table and dumping the randomized results into it is a good alternative that I have to try.

      To wrap up, is there a good way to issue a single query so that I don't have to randomize the elements of my resulting array? For example, if I am the user, and I selected the subjects: Mathematics (with the following categories: Algebra, Geometry) and Astronomy (with the categories: Solary System, Astronomers), what would be a good way of issue a query? Without doing my lousy way of doing a query for each subject?

      Thanks! JAY

        MySQL may repeat results if you do repeated queries with ORDER BY rand LIMIT 1. But it won't if you ask for all 10 questions at once---ORDER BY rand LIMIT 10.

        You may have to issue one query per subject, but so what? Unless there are dozens of subjects, the speed difference won't be noticeable.

        SELECT ques_id FROM questions q, categories c WHERE q.category_id = c.category_id AND category = 'Mathematics' AND subject in ('Calculus', 'Geometry', 'Algebra') ORDER BY RAND() LIMIT 1
Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?
by b10m (Vicar) on Jan 03, 2004 at 14:11 UTC

    I'm not sure wheter this is a "better" way than the MySQL order by RAND() function, but of course you could have Perl create a "random" number (see rand) and use something like:

    SELECT * FROM table WHERE id=$rand

    I assume that you have a column called 'id' which contains a unique number for and $rand contains the random number generated.

    You might want to benchmark both methods (and possibly more methods to come) and see what works best.

    HTH

    --
    b10m
Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?
by epoptai (Curate) on Jan 03, 2004 at 15:55 UTC
    You can select any number of random rows by limiting the results of your query:
    SELECT question, answer FROM table ORDER BY RAND() LIMIT 1
    It'll be much faster and more efficient than pushing all the results into an array and dumping them into a temporary table. That seems like too much work for the task you described, unless there's a good reason for the temp table.

    --
    perl -MO=Deparse -e"u j t S n a t o e h r , e p l r a h k c r e"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://318504]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found