Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^2: Get random unique lines from file

by roboticus (Chancellor)
on Aug 17, 2012 at 23:42 UTC ( [id://988126]=note: print w/replies, xml ) Need Help??


in reply to Re: Get random unique lines from file
in thread Get random unique lines from file

radnorr:

I had a little more time on my hands, so I tested it a bit. It turns out that there's a slight bias for lines from the start of the file. I haven't done any of the math for it, so I don't know if it'll be easy to fix or not. (The bias is small enough that I'm thinking it might be something as simple as never overwriting the last entry in @sample or something. So it might get an early line and have it stick around until the end.)

I don't really have the time or inclination to run it down at the moment, but if I were going to do so, I'd do a couple things:

  • Run my test a couple more times with different numbers of samples, to see if I could easily vary the slope of the bias.
  • Make the sample array have *3* slots per entry instead of 2. I'd use the third slot to hold the number of times the slot was overwritten and histogram that as well. Just in case it's something like the fencepost error mentioned earlier.

Anyway, to test it, I did 1000 runs of 25 samples against the million-line file generated in my previous post:

$ for J in {0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7, +8,9}; do ./pm_sample_lines_from_file.pl a_million_lines 25 >t.$J; don +e

While it was cooking, I quickly put together something to crunch the files and check:

#!/usr/bin/perl + # # pm_sample_histo_test.pl <FNames> # # Make a brief histogram to see if there's any obvious bias in the sam +pler. # Assumes you're sampling from a file containing a million lines, each + of # which contains just the line number. # use 5.10.1; use strict; use warnings; my %H; for my $FName (@ARGV) { print "-- $FName --\n"; open my $FH, '<', $FName or die "can't open $FName\n"; while (<$FH>) { next if /^random/; my $I = int $_/10000; $H{$I}++; } } my $max=0; for (keys %H) { $max = $H{$_} if $H{$_} > $max; } for my $I (0 .. 99) { my $bar = substr('*' x int(50 * ($H{$I}/$max)) . ' 'x50, 0, 50); if ($I > 1 and $I < 98) { my $avg; my $sum = 0; $sum += $H{$_} for $I-2 .. $I+2; $avg = $sum/5; substr($bar,int(50 * ($avg/$max)),1)='+'; } printf "% 3u: (% 3u) %s\n", $I, $H{$I} // 0, $bar; }

It accumulates the samples from the million-line file into 100 buckets (0..10000 for the first one, 10001..20000 for the second one, etc.). It then plots a histogram, and overlays it with a 5-sample moving average:

$ ./pm_sample_histo_test.pl t.* 0: (321) ************************************************** 1: (290) ********************************************* 2: (290) *********************************************+ 3: (291) *********************************************+ 4: (281) ******************************************* + 5: (307) *********************************************+* 6: (290) ********************************************+ 7: (283) ********************************************+ 8: (280) ******************************************* + 9: (283) ********************************************+ 10: (309) ********************************************+*** 11: (287) *******************************************+ 12: (267) ***************************************** + 13: (253) *************************************** + 14: (292) *****************************************+*** 15: (257) **************************************** + 16: (278) ******************************************+ 17: (254) *************************************** + 18: (286) ******************************************+* 19: (250) ************************************** + 20: (291) ******************************************+** 21: (277) ******************************************+ 22: (270) ******************************************+ 23: (262) **************************************** + 24: (261) ****************************************+ 25: (263) ***************************************+ 26: (246) ************************************** + 27: (237) ************************************ + 28: (261) ***************************************+ 29: (245) ************************************** + 30: (265) ***************************************+* 31: (267) ***************************************+* 32: (234) ************************************ + 33: (269) ****************************************+ 34: (252) *************************************** + 35: (272) ****************************************+* 36: (259) ****************************************+ 37: (243) ************************************* + 38: (266) ***************************************+* 39: (237) ************************************ + 40: (259) ***************************************+ 41: (256) ***************************************+ 42: (249) ************************************** + 43: (269) ****************************************+ 44: (274) *****************************************+ 45: (266) ****************************************+ 46: (267) ***************************************+* 47: (229) *********************************** + 48: (233) ************************************ + 49: (265) ***************************************+* 50: (247) ************************************** + 51: (281) ****************************************+** 52: (243) ************************************* + 53: (249) ************************************** + 54: (249) ************************************** + 55: (251) **************************************+ 56: (262) *************************************+** 57: (228) *********************************** + 58: (217) ********************************* + 59: (262) *************************************+** 60: (260) **************************************+* 61: (225) *********************************** + 62: (263) ****************************************+ 63: (297) ****************************************+***** 64: (271) ******************************************+ 65: (250) ************************************** + 66: (275) *****************************************+ 67: (267) ***************************************+* 68: (260) ***************************************+ 69: (224) ********************************** + 70: (256) **************************************+ 71: (241) ************************************* + 72: (257) ***************************************+ 73: (256) ***************************************+ 74: (257) ***************************************+ 75: (246) ************************************** + 76: (267) ***************************************+* 77: (267) **************************************+** 78: (232) ************************************ + 79: (239) ************************************* + 80: (252) **************************************+ 81: (269) **************************************+** 82: (234) ************************************ + 83: (239) ************************************* + 84: (240) ************************************* + 85: (259) ***************************************+ 86: (290) ****************************************+**** 87: (249) ************************************** + 88: (256) ***************************************+ 89: (235) ************************************ + 90: (242) ************************************* + 91: (261) **************************************+* 92: (245) **************************************+ 93: (245) ************************************** + 94: (248) **************************************+ 95: (274) ***************************************+** 96: (230) *********************************** + 97: (263) **************************************+* 98: (241) ************************************* 99: (236) ************************************

As you can see, at the bottom side there are 290-ish lines selected, and at the top end, there are 250-ish lines selected.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://988126]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-25 23:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found