Re: Get random unique lines from file

in reply to Get random unique lines from file

Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:

#!/usr/bin/perl                             
#                                           
# sample_random_lines_from_file.pl  <FName> <NumSamples>     
use 5.10.1;                                 
use strict;                                 
use warnings;                               
use autodie;                                
                                            
my @samples;                                
                                            
my $FName = shift // die "missing: <filename> <numsamples>"; 
my $num = shift // die "missing: <numsamples>";              
                                            
open my $FH, '<', $FName;                   
while (<$FH>) {                             
    if ($num/$. > rand) {                   
        my $i = @samples;                   
        if ($i > $num) { $i = rand @samples; }
        #print "slot $i, size=" . scalar(@samples) . ", line $.\n";
        $samples[$i]=[ $., $_ ];            
    }                                       
}                                           
                                            
print "random samples:\n";                  
print $$_[1] for sort { $$a[0] <=> $$b[0] }  @samples;
[download]

I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:

$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines

marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
29748
135818
143918
164669
216447
245165
267754
404776
419876
487740
893947

marco@Boink:/Work/Tools/SQL/parser
$ perl pm_sample_lines_from_file.pl a_million_lines 10
random samples:
163918
434324
435340
534748
596221
611074
677311
682939
719979
842687
998139
[download]

There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).

I hope you find it useful.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In Section Seekers of Perl Wisdom