Extract random records from text file

SuperCruncher has asked for the wisdom of the Perl Monks concerning the following question:

I am currently developing a generic web-based multiple choice testing system backend for a road safety project I'm currently working on. The system, Multiple Choice Assessment System (MCAS), has already given me a lot of problems (but BTW thanks to arturo, clemburg, mattr and Chmrr who have already helped me out a lot!).

All the questions are stored in a simple text file. A sample question looks like this:

question=When driving towards a bright setting sun, glare can be reduc
+ed by
option1=closing one eye
option2=dipping in the interior mirror
option3=wearing dark glasses
option4=looking sideways
answer=4
answertext=Dark glasses are often used as a fashion item but they do h
+ave their practical uses. Low sun in the early morning or evening can
+ dazzle and distract. Lessen the risk by reducing the glare. Wear dar
+k glasses if they help you to see better. 
category1=car
category2=vechicle handling
[download]

Questions in the file are delimited by \n===\n and each question may have a variable number of lines (as number of options can vary and some questions have 'visual aids' etc.)

The idea is that each test has n questions (n is specified in a config file). A script is run on the main questions file that produces a preset number of 'test files'. When the system is eventually ready to 'go live', a random 'test file' is selected and converted to HTML to be displayed to the user.

What I'm having trouble with now is my make_tests.pl script that, as the name implies :-) creates the 'test files'. It is meant to randomly select n questions from the main questions file and save them as a 'test file'. Thanks partially to the above monks and partially to 'The Perl Cookbook', I know have the following code (that is intended to return a random list of n questions):

sub get_std_test { 
  die +(caller(0))[3].' cannot be called in void context' 
    if not defined wantarray; 

  $/ = "===\n";   # set our delimiter  
  
  my %config = mcas::get_config(); # returns a hash of all config vars
  my $count = 0; 
  my @questions; 

  open QUESTIONS, '<questions' or die "Cannot open QUESTIONS file: $!"
+; 
  while (<QUESTIONS>) {
    if (rand($.) < 1) { 
      if ($count == $config{num_questions_per_test}) { 
        last; 
      } else { 
        push @questions, $_; 
        $count++
      }
    }
  } 
  close QUESTIONS;
  
  die "\$count = $count, < num_questions_per_test config var" 
    if ($count < $config{num_questions_per_test}); 

  return @questions; 
}
[download]

The function currently seems to return only lines, not whole questions. I think clemburg in particular might have warned me about this, and said that I needed some kind of 'read_record' routine but I've got no idea how to go about this (in particular, how to 'maintain state' or maintain file position between different invocations of the function).

If anyone can offer any advice (or even code!), I'd greatly appreciate it.

Comment on Extract random records from text file Select or Download Code

Replies are listed 'Best First'.
Re: Extract random records from text file by runrig (Abbot) on Oct 02, 2001 at 21:33 UTC
For a solution which does not suck in the whole file to memory (and I don't see any of those yet), here's a variation on How do I select a random line from a file?: `my $q_file = "tmp.txt"; my $n = 10; my @ques = get_questions($q_file, $n); print for @ques; sub get_questions { local $/="===\n"; local $.; # Is this necessary? local @ARGV = (shift); # If you want 'open ... or die' behavior local $SIG{__WARN__} = sub { die shift }; my $num = shift; my @questions; while (<>) { push(@questions, $_), next if $. <= $num; my $i = rand($.); $questions[$i] = $_ if $i < $num; } @questions; }` [download]	[reply] [d/l]
Re (tilly) 2 (unrandomness): Extract random records from text file by tilly (Archbishop) on Oct 02, 2001 at 22:15 UTC
I like this solution best because it does something honestly new and unusual. However there is one caveat. While it does indeed avoid reading the file into memory, and the actual questions chosen are randomized, the first 10 questions in the list will always appear "in position" if they appear. For perfectly random questions, therefore, it would be best to throw in a shuffle at the end.	[reply]
Re: Re (tilly) 2 (unrandomness): Extract random records from text file by runrig (Abbot) on Oct 02, 2001 at 23:34 UTC
Or you can just shuffle them from the start: `my $num = shift; my @init = (0..$num-1); my @questions; while (<>) { $questions[splice @init, rand(@init), 1] = $_, next if $. <= $num; my $i = rand($.); $questions[$i] = $_ if $i < $num; }` [download]	[reply] [d/l]
Re (tilly) 4 (unrandomness): Extract random records from text file by tilly (Archbishop) on Oct 03, 2001 at 21:30 UTC
Re: Extract random records from text file by merlyn (Sage) on Oct 02, 2001 at 20:53 UTC
quick untested code: `my $file = do { local $/; <FILE> }; # slurp my @questions = $file =~ /(^question.\n(?:(?!question).\n))/gm; for (1..10) { # 10 questions at random my ($one) = splice @questions, rand @questions, 1; my %param = $one =~ /^(\w+)=(.)/gm; print $param{question}, "\n"; print "$_: ", $param{"option$_"}, "\n" for 1..4; ... etc ... }` [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Extract random records from text file by jeroenes (Priest) on Oct 02, 2001 at 20:28 UTC
Look again at rand and perlvar. `use POSIX qw/floor/; local $/= '==='; my @allquestions = <QUESTIONS>; my @questions; for (1..$maxcount){ my $idx = floor rand @allquestions; push @questions, splice @allquestions, $idx, 1; } return @questions;` [download] Than you'll have to solve your delimiter problem. Dunno what it is, run some checks on that. (We're discussing that in CB right now) And, of course, use strict and warnings. Jeroen "We are not alone"(FZ) Update I added the local thing... too important to miss out... read Why or why not? - local undef $/ and Use strict warnings and diagnostics or die eg. Update 2 merlyn made me realize (again) that you don't need to floor the stuff: `local $/= '==='; my @allquestions = <QUESTIONS>; my @questions; push @questions, splice @allquestions, rand @allquestions, 1 for (1..$ +maxcount) return @questions;` [download]	[reply] [d/l] [select]
Re: Extract random records from text file by demerphq (Chancellor) on Oct 02, 2001 at 20:47 UTC
This works for me: local $,=$/."**".$/; local $\=$/; sub rand_questions { my $total=shift; local $/ = "===\n"; # set our delimiter my @data=<DATA>; #fisher yates shuffle my $i; for ($i = @data; --$i; ) { my $j = int rand ($i+1); # next if $i == $j; # See tillys reply @data[$i,$j] = @data[$j,$i]; } my @questions=@data[0..$total-1]; return @questions; } print rand_questions(2); __DATA__ question=When driving towards a bright setting sun, glare can be reduc +ed by option1=closing one eye option2=dipping in the interior mirror option3=wearing dark glasses option4=looking sideways answer=4 answertext=Dark glasses are often used as a fashion item but they do h +ave their practical uses. Low sun in the early morning or evening can + dazzle and distract. Lessen the risk by reducing the glare. Wear dar +k glasses if they help you to see better. category1=car category2=vechicle handling === question=2When driving towards a bright setting sun, glare can be redu +ced by option1=closing one eye option2=dipping in the interior mirror option3=wearing dark glasses option4=looking sideways answer=4 answertext=Dark glasses are often used as a fashion item but they do h +ave their practical uses. Low sun in the early morning or evening can + dazzle and distract. Lessen the risk by reducing the glare. Wear dar +k glasses if they help you to see better. category1=car category2=vechicle handling === question=3When driving towards a bright setting sun, glare can be redu +ced by option1=closing one eye option2=dipping in the interior mirror option3=wearing dark glasses option4=looking sideways answer=4 answertext=Dark glasses are often used as a fashion item but they do h +ave their practical uses. Low sun in the early morning or evening can + dazzle and distract. Lessen the risk by reducing the glare. Wear dar +k glasses if they help you to see better. category1=car category2=vechicle handling === [download] HTH Update** I left out various errorchecking and the like, but you can add that. Yves -- You are not ready to use symrefs unless you already know why they are bad. -- tadmc (CLPM)	[reply] [d/l]
Re (tilly) 2 (small optimization): Extract random records from text file by tilly (Archbishop) on Oct 02, 2001 at 22:26 UTC
As noted in Fisher-Yates Shuffle, the check for whether the two elements are equal is inefficient. The cost of adding the check scales linearly with the number of elements. The number of saved swaps scales logarithmically. The only time it is important to have the check is when your swap will run into trouble if the two elements are the same. This is true in C if you are using xor to perform the swap in place. It is not true in Perl. Therefore in Perl it is both faster and clearer to drop the check.	[reply]
Re: Extract random records from text file by suaveant (Parson) on Oct 02, 2001 at 20:39 UTC
I would do something more like this (untested, but should work) sub get_std_test { die +(caller(0))[3].' cannot be called in void context' if not defined wantarray; local $/; # set our delimiter to nothing, slurp file my %config = mcas::get_config(); # returns a hash of all config vars my @questions; open QUESTIONS, 'questions' or die "Cannot open QUESTIONS file: $!"; my @questions = split "===\n", <QUESTIONS>; close QUESTIONS; while(@questions > $config{num_questions_per_test}) { splice @questions, int(rand(@questions)), 1; } die "\$count = $count, < num_questions_per_test config var" if (@questions < $config{num_questions_per_test}); return @questions; } [download] I set $/ to undef, and read the whole thing in, splitting it on your delimeter into @questions. I then randomly remove (splice) questions out of the array until it contains the number of questions you want... - Ant - Some of my best work - Fish Dinner	[reply] [d/l]
Re: Extract random records from text file by George_Sherston (Vicar) on Oct 02, 2001 at 20:27 UTC
in your `while` loop, `$_` is only the current line of the file in `QUESTIONS` - the globbing operator doesn't know that you are using a different delimiter than just a \n. Insta-update: sorry to insult your intelligence - reviewing the following nodes I finally understood what the input record separator does... which was nice for me but not much help to you. So I now realise that my solution may be a good solution, but it's not a solution to the real problem. But here, for the benefit of a laughing posterity, it is: Unless memory is a constraint or `questions` is a really huge file, I think the simplest solution would be to read the whole file into a scalar, and then split it into an array of questions with `@questions = split (/===\n/, $scalar)`. Then you can run your loop, but instead of `while (<QUESTIONS>) {` [download] you would use `for (@questions) {` [download] � George Sherston	[reply] [d/l] [select]
Re: Extract random records from text file by tommyw (Hermit) on Oct 02, 2001 at 20:27 UTC
Your question selector is broken, even if you arrange for one question per line. Question 1 will be pushed into the array. Question 2 has a 1/2 chance of being pushed. Question 3 has a 1/3 chance of being pushed. If you (for example) only want 2 questions, then you've now got a 5/6 chance of having selected them. So question 4 will only be tested 1/6 of the time, and then get pushed 1/4 of that: a total selectability of 1/24, not the 1/4 you intended. if you really want to get all the questions in a single pass through the file, then you need to determine whether to keep the question (using rand), and then push the relevant details into a stack. At the end pop the last N elements off and use them. That's esentially what the trick in the Cookbook does, but it uses a simple stack, of only one entry (aka a variable ;)	[reply]
Re: Extract random records from text file by cfreak (Chaplain) on Oct 02, 2001 at 20:37 UTC
It looks like you are reading in the questions file one line at a time rather than parsing them by the === separator. I would do something more like this: `my @lines = <QUESTIONS>; # The entire file in an array my @questions = (); # A new array for questions my $tmp = ""; # A tmp variable to capture stuff in my $count = 0; foreach my $line(@lines) { if($line =~ /^===$/) { push(@questions,$tmp) unless $count == 0; $tmp = ""; next; } $tmp .= $line; } # and then to get the last one push(@questions,$tmp) if $tmp;` [download] Then to get your random stuff `my @rand_questions = (); foreach (0 .. $num_questions_per_test) { my $rand = int(rand(scalar(@questions) - 1); push(@rand_questions,$questions[$rand]); }` [download] That would put all the random questions in the array rand_questions. I'm sure there are ways to make it simpler I tend to code the long way so I can understand it when I look at it 6 months later :) hope that helps.	[reply] [d/l] [select]


Pathologically Eclectic Rubbish Lister
	PerlMonks