Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?

by gmax (Abbot)
on Jan 03, 2004 at 19:54 UTC ( #318531=note: print w/replies, xml ) Need Help??

in reply to Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?

There are a few ways of achieving what you want.

  • The one you are using (SELECT ... ORDER BY RAND() LIMIT 1) is easy and immediate. As you said, efficiency could be very poor when your table size increases.

    However, you can improve the execution speed if you only select the primary key column in your query.

    Note that this code can be used also if you want to get more than one random question at once.

    my $limit = 1; my $query = qq{SELECT id FROM table ORDER BY RAND() LIMIT $limit}; my $ids = $dbh->selectcol_arrayref($query) or die "execution error ($DBI::errstr\n"; $query = qq{SELECT question, answer FROM table WHERE id in (} . (join ",", @$ids) . " ) " ;

    Selecting only the primary key will reduce the amount of data that the DBMS has to process. Therefore, selecting and sorting will be faster.

  • A very fast alternative is available when your data doesn't change so often. If your data changes, say, once a day but you are querying around the clock, then you can make an auxiliary table each time you update or insert something.

    mysql> create table if not exists support_tbl -> (cnt int not null auto_increment primary key, -> id int not null); mysql> truncate support_tbl; mysql> insert into support_tbl (id) select id from table order by rand +();

    Here you have a supporting table with primary keys in random order. Yopu can get a pageload of IDs very easily:

    my $cnt =1; my $ids = $dbh->selectcol_arrayref(qq{select id from support_tbl where cnt = ?}, undef, $cnt);

    For your next request, you will increase $cnt according to your needs.

  • Another possibility is calculating the random value in Perl.

    # make sure our random value is not higher than # any value in the primary key my ($max) = @{$dbh->selectcol_arrayref(qq{select max(id) from table})} +; my $cnt = rand($max); my $query = qq{SELECT question, answer FROM table WHERE id = $cnt };

    WARNING! Although this solution is fast, there is a catch. It will only work if your data does not have any gaps in your primary key column, i.e. if you deleted any row this method may not work as expected.


 _  _ _  _  
(_|| | |(_|><
  • Comment on Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?
  • Select or Download Code

Replies are listed 'Best First'.
Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?
by sgifford (Prior) on Jan 04, 2004 at 08:17 UTC

    I used the third technique in a recent project, but instead of using the primary key (which was a non-sequential ID number), I added a rownum column which gave each row a sequential number. I then selected a random value between 1 and the number of rows.

    The reason this is much faster than the ORDER BY rand() technique is that (AFAIK) in order to resolve ORDER BY rand(), MySQL has to go through each row, generate a random number, then see which one is the lowest. If the table is very large, that takes a long time.

    The problem with using a row number is that if the data changes, you can end up gaps with in numbering. If the data is generated periodically, there's no problem; just regenerate the row numbers with the rest of the data. Periodically, you can regenerate the row numbers to fill in gaps. If there's just a few missing numbers scattered here and there and you don't need great randomness, it's no big deal; instead of using WHERE id = $rand you'd use WHERE id >= $rand LIMIT 1. If you have a lot of changes, though, this technique will work poorly.

    Something that might work a little better with a large number of changes would be to randomly assign each record a number between, say, 1 and 100. To choose a random record, then, you would first pick a number between 1 and 100, then you would get a record randomly from the set of rows assigned that number with ORDER BY rand(). It would still have to evaluate multiple records, but on average it would only be 1/100 as many records as using ORDER BY rand() on the whole table.

      That's a good suggestion! I believe you are right, it's a lot faster.

      Now I am left with one problem. In order to implement your suggestion, I only need to issue a single query from my questions table in the database. PLEASE SEE MY RESPONSE TO GMAX FOR AN IDEA.

      Thanks a lot! Jay
        To make this technique work with questions grouped into subjects, you'd want to assign a per-subject question number, then for each subject pick a random number between 1 and the number of questions in that subject to find the record.
Re: Re: Is there a good way of retrieving entries randomly from a database (MySQL perhaps)?
by soon_j (Scribe) on Jan 04, 2004 at 08:41 UTC
    Thanks for the suggestions! But, here's the complete scenario. The questions table have columns: ques_id (PRIMARY KEY AUTO_INCREMENT), subject (VARCHAR), category_id (INT UNSIGNED), option_a (VARCHAR), ... option_d (VARCHAR). Another table (categories), have the following columns: category_id (PRIMARY KEY AUTO_INCREMENT), subject (VARCHAR), category (VARCHAR).

    First, I have to retrieve all DISTINCT subjects from the categories table. (Example, I retrieve the following distinct subjects: Biology, Astronomy,... Mathematics). Since the user configures which subjects he wants to include on his exam, then I really have to do that. So this what I did (SELECT DISTINCT subject FROM categories).

    Next, I have to retrieve also the categories under the selected subject. (Example, I have to retrieve the ff: categories under the subject Mathematics: Algebra, Geometry,... Calculus). So I have to do this (SELECT ques_id FROM questions WHERE subject='Subject1' and Category='Category1' ORDER BY RAND()). This routine would be repeated for each subject.

    After each query described above, but before starting a new query for the next subject, I pushed the random results to an array. On the next subject, the results would be added or pushed to the array previously described.

    This would result into an array having random ques_id's. However, though random the results are from each query, they are clustered according to subjects. Like, for instance, the first ten elements may be about Biology, the next 10 might be of Astronomy, and so forth. This is the reason why I have to randomize again the elements in the resulting array to make sure they are truly random.

    Since the number of subjects could vary... like if a subject Religion was added to the existing subjects (Biology, Astronomy,...), I did not issue a single query like (SELECT ques_id FROM questions WHERE (subject='Subject1' and category='Category1') OR (subject='Subject2' and category='Category2')). If only I could issue a single effective query (which I haven't had a good idea yet), your suggestion of (SELECT ques_id FROM ... ORDER BY RAND() LIMIT 1) could be extremely useful. But with this type of query, would there be a chance that the same question would be repeated? Looks like... or is MySQL smart enough to know and avoid repeated results from RAND() LIMIT 1?

    Your next suggestion of creating a table and dumping the randomized results into it is a good alternative that I have to try.

    To wrap up, is there a good way to issue a single query so that I don't have to randomize the elements of my resulting array? For example, if I am the user, and I selected the subjects: Mathematics (with the following categories: Algebra, Geometry) and Astronomy (with the categories: Solary System, Astronomers), what would be a good way of issue a query? Without doing my lousy way of doing a query for each subject?

    Thanks! JAY

      MySQL may repeat results if you do repeated queries with ORDER BY rand LIMIT 1. But it won't if you ask for all 10 questions at once---ORDER BY rand LIMIT 10.

      You may have to issue one query per subject, but so what? Unless there are dozens of subjects, the speed difference won't be noticeable.

      SELECT ques_id FROM questions q, categories c WHERE q.category_id = c.category_id AND category = 'Mathematics' AND subject in ('Calculus', 'Geometry', 'Algebra') ORDER BY RAND() LIMIT 1
        Just re-writing it, with a bit of correction (not a syntax type):

        SELECT ques_id
        FROM questions q, categories c
        WHERE q.category_id = c.category_id
        AND subject = 'Mathematics'
        AND category in ('Calculus', 'Geometry', 'Algebra')
        ORDER BY RAND()
        LIMIT 1

        The above query is for the subject 'Mathematics' only. How about for the other subjects? I am try to avoid repeating the above query for the next subject like Astronomy perhaps. I am try to consolidate all these tiny queries for each subject into a single and systematic query.

        Thanks! Jay

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://318531]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2022-12-10 05:28 GMT
Find Nodes?
    Voting Booth?

    No recent polls found