http://qs321.pair.com?node_id=988117


in reply to Get random unique lines from file

radnorr:

Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:

#!/usr/bin/perl # # sample_random_lines_from_file.pl <FName> <NumSamples> use 5.10.1; use strict; use warnings; use autodie; my @samples; my $FName = shift // die "missing: <filename> <numsamples>"; my $num = shift // die "missing: <numsamples>"; open my $FH, '<', $FName; while (<$FH>) { if ($num/$. > rand) { my $i = @samples; if ($i > $num) { $i = rand @samples; } #print "slot $i, size=" . scalar(@samples) . ", line $.\n"; $samples[$i]=[ $., $_ ]; } } print "random samples:\n"; print $$_[1] for sort { $$a[0] <=> $$b[0] } @samples;

I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:

$ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines marco@Boink:/Work/Tools/SQL/parser $ perl pm_sample_lines_from_file.pl a_million_lines 10 random samples: 29748 135818 143918 164669 216447 245165 267754 404776 419876 487740 893947 marco@Boink:/Work/Tools/SQL/parser $ perl pm_sample_lines_from_file.pl a_million_lines 10 random samples: 163918 434324 435340 534748 596221 611074 677311 682939 719979 842687 998139

There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).

I hope you find it useful.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^2: Get random unique lines from file
by roboticus (Chancellor) on Aug 17, 2012 at 23:42 UTC

    radnorr:

    I had a little more time on my hands, so I tested it a bit. It turns out that there's a slight bias for lines from the start of the file. I haven't done any of the math for it, so I don't know if it'll be easy to fix or not. (The bias is small enough that I'm thinking it might be something as simple as never overwriting the last entry in @sample or something. So it might get an early line and have it stick around until the end.)

    I don't really have the time or inclination to run it down at the moment, but if I were going to do so, I'd do a couple things:

    • Run my test a couple more times with different numbers of samples, to see if I could easily vary the slope of the bias.
    • Make the sample array have *3* slots per entry instead of 2. I'd use the third slot to hold the number of times the slot was overwritten and histogram that as well. Just in case it's something like the fencepost error mentioned earlier.

    Anyway, to test it, I did 1000 runs of 25 samples against the million-line file generated in my previous post:

    $ for J in {0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7, +8,9}; do ./pm_sample_lines_from_file.pl a_million_lines 25 >t.$J; don +e

    While it was cooking, I quickly put together something to crunch the files and check:

    #!/usr/bin/perl + # # pm_sample_histo_test.pl <FNames> # # Make a brief histogram to see if there's any obvious bias in the sam +pler. # Assumes you're sampling from a file containing a million lines, each + of # which contains just the line number. # use 5.10.1; use strict; use warnings; my %H; for my $FName (@ARGV) { print "-- $FName --\n"; open my $FH, '<', $FName or die "can't open $FName\n"; while (<$FH>) { next if /^random/; my $I = int $_/10000; $H{$I}++; } } my $max=0; for (keys %H) { $max = $H{$_} if $H{$_} > $max; } for my $I (0 .. 99) { my $bar = substr('*' x int(50 * ($H{$I}/$max)) . ' 'x50, 0, 50); if ($I > 1 and $I < 98) { my $avg; my $sum = 0; $sum += $H{$_} for $I-2 .. $I+2; $avg = $sum/5; substr($bar,int(50 * ($avg/$max)),1)='+'; } printf "% 3u: (% 3u) %s\n", $I, $H{$I} // 0, $bar; }

    It accumulates the samples from the million-line file into 100 buckets (0..10000 for the first one, 10001..20000 for the second one, etc.). It then plots a histogram, and overlays it with a 5-sample moving average:

    $ ./pm_sample_histo_test.pl t.* 0: (321) ************************************************** 1: (290) ********************************************* 2: (290) *********************************************+ 3: (291) *********************************************+ 4: (281) ******************************************* + 5: (307) *********************************************+* 6: (290) ********************************************+ 7: (283) ********************************************+ 8: (280) ******************************************* + 9: (283) ********************************************+ 10: (309) ********************************************+*** 11: (287) *******************************************+ 12: (267) ***************************************** + 13: (253) *************************************** + 14: (292) *****************************************+*** 15: (257) **************************************** + 16: (278) ******************************************+ 17: (254) *************************************** + 18: (286) ******************************************+* 19: (250) ************************************** + 20: (291) ******************************************+** 21: (277) ******************************************+ 22: (270) ******************************************+ 23: (262) **************************************** + 24: (261) ****************************************+ 25: (263) ***************************************+ 26: (246) ************************************** + 27: (237) ************************************ + 28: (261) ***************************************+ 29: (245) ************************************** + 30: (265) ***************************************+* 31: (267) ***************************************+* 32: (234) ************************************ + 33: (269) ****************************************+ 34: (252) *************************************** + 35: (272) ****************************************+* 36: (259) ****************************************+ 37: (243) ************************************* + 38: (266) ***************************************+* 39: (237) ************************************ + 40: (259) ***************************************+ 41: (256) ***************************************+ 42: (249) ************************************** + 43: (269) ****************************************+ 44: (274) *****************************************+ 45: (266) ****************************************+ 46: (267) ***************************************+* 47: (229) *********************************** + 48: (233) ************************************ + 49: (265) ***************************************+* 50: (247) ************************************** + 51: (281) ****************************************+** 52: (243) ************************************* + 53: (249) ************************************** + 54: (249) ************************************** + 55: (251) **************************************+ 56: (262) *************************************+** 57: (228) *********************************** + 58: (217) ********************************* + 59: (262) *************************************+** 60: (260) **************************************+* 61: (225) *********************************** + 62: (263) ****************************************+ 63: (297) ****************************************+***** 64: (271) ******************************************+ 65: (250) ************************************** + 66: (275) *****************************************+ 67: (267) ***************************************+* 68: (260) ***************************************+ 69: (224) ********************************** + 70: (256) **************************************+ 71: (241) ************************************* + 72: (257) ***************************************+ 73: (256) ***************************************+ 74: (257) ***************************************+ 75: (246) ************************************** + 76: (267) ***************************************+* 77: (267) **************************************+** 78: (232) ************************************ + 79: (239) ************************************* + 80: (252) **************************************+ 81: (269) **************************************+** 82: (234) ************************************ + 83: (239) ************************************* + 84: (240) ************************************* + 85: (259) ***************************************+ 86: (290) ****************************************+**** 87: (249) ************************************** + 88: (256) ***************************************+ 89: (235) ************************************ + 90: (242) ************************************* + 91: (261) **************************************+* 92: (245) **************************************+ 93: (245) ************************************** + 94: (248) **************************************+ 95: (274) ***************************************+** 96: (230) *********************************** + 97: (263) **************************************+* 98: (241) ************************************* 99: (236) ************************************

    As you can see, at the bottom side there are 290-ish lines selected, and at the top end, there are 250-ish lines selected.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^2: Get random unique lines from file
by BrowserUk (Patriarch) on Aug 17, 2012 at 23:00 UTC

    The problem with this is that it only works with individual lines, but FASTA files contain multi-line records each consisting of a one header line and a variable number of payload lines.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      BrowserUk:

      Ah, well, I know jack about FASTA files, so I didn't consider that. Of course, by changing the reader to accumulate records instead of lines, it could be adapted. Though since there are already a couple working examples from you and Marshall, and since mine has a bias in it, there's no real reason to do so.

      I know that *you* know how to do the changes, but if someone tripping across this node in the future wants to do it, you can do so something (untested!) like this:

      my @record; while (<$FH>) { if (/start of record marker/) { ++$cnt_recs; if ($num/$cnt_recs > rand) { my $i=@samples; if ($i > $num) { $i = rand @samples; } $samples[$i]=[$cnt_recs, [@record]]; } } else { # Accumulate record push @record, $_; } }

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        I don't think that this is the best way...
        BrowserUk and I both used the core module: List::Util::shuffle;

        He understood the FASTA format better than I did and that is fine given that the format of the OP's question was hard to "decode".

        The main point is that is that this core shuffle() function works very well, is very fast (a core function that is implemented in 'C') and who's interface is easy to understand. I recommend using it rather than trying to "roll your own".

        Oh, BTW, "Core Function" means that this is available on all Perl systems as part of the language - no "extra module installation" is required. .... Well I don't know exactly about "all", but I figure since Perl 5.6 (for more than decade).