http://qs321.pair.com?node_id=988096

radnorr has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a bit of a snag with my perl code. I am trying to take a random unique sampling of a file containing >100,000 sequences. But right now the way the data is listed I can't seem to figure out how to write a good code.

What I have so far.....>

#!/user/bin/perl -w + # Use your own path + use strict; # always use strict + my $file = "josh_vamps_seqs_ALL.fa"; open (FILE, $file) or die "$0: $!"; # printing errors is a Good Thing +(tm) my @file = <FILE>; # Store all the lines in an array + close FILE; open OUT, ">rarefaction.fa" or die "$0: $!"; # Caution: will overwrite + any existing files my $i; for ($i = 0; $i<4266;$i++) { # Start sampling loop + my $rand = int(rand($#file)); # rand gives you a random number between 0 and the number of lines rem +aining in the array, note # that this number is by necessity dynamic, it will drop 1000, 999, 99 +8, etc. my $sample = splice @file, $rand, 1; # Cut out the $rand-th line, +offset 1 means just one line print OUT $sample; } # end of for loop

What it gives me ......>

CTTTTCTTCGGACTACTTACAAGGTGTTGCATGGTCGTC >FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09 CGAGTCAACGCGCAGAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA TCTCTATGCGCAACGCGAAAACCTTACCTGGCCTTGACATGCATCTCTAAGCGTGTGAAA >FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26 TGGTGCCTTCGGGAACGCAGTGACAGGTGATGCATGG AAACCCTCAGAGACTTCGGTTAATGACATGTTTACAGGTGATGCATGGCCGTCG >E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22 TTCGGTTCGGCCGGACGAAACACAGGTGT TAGTGCGACGCGAAGAACCTTACCAGGGCTTAAATGTAGTGGGACAGGTCTAGAGATAGA GGGTGCCCTTCGGGGAATCTAGTGAGAGGTGTTGCATGGCCGTCG GTGAGCAACGCGCAGAACCTTACCAACCCTTGACATCCTGTGCTACTACCAGAGATGGTA TACATCTACGCGAAGAACCTTATCTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAATCTAGTGACAGGTGATGCATGGCTGTCG CACACCAACGCGAAAAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA TTCGGTTCGGCCGGACGAAACACAGGTGTTGCATGGCTGTC

What I actually want .........>

FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09 FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26 E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22

What input file looks like ......>

>FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTACCAGGTCCTGACTTCCTGACTATGGTTATTAGAAATAA TTTCCTTCAGTTCGGCTGGGTCAGTGACAGGTGATGCATGGCCGTC >FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTAACCGATGAACCTTACCTACACTTGACATGCAGAGAACTTTCCAGAGATGGAT TGGTGCCTTCGGGAACTCTGACACAGGTGATGCATCGCCGTC >FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTACGCGAAGAACCTTACCTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAACTCTGATACAGGTGATGCATGGCTGTC >FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTACCAACCCTTGACATCCAGAGAATTTTCTAGAGATAGAT TTGTGCCTTCGGGAACTCTGTGACAGGTGATGCATGGCTGTC

I don't know if there is a way for the random function of perl to recognize a specific piece of the input say the ">" and then print the line that follows? Any pointers would be greatly appreciated.>

Replies are listed 'Best First'.
Re: Get random unique lines from file
by kennethk (Abbot) on Aug 17, 2012 at 22:00 UTC
    Your post could be served greatly by improved formatting. Specifically, by wrapping your code, input and output in <code> tags, your code and samples wouldn't get mangled, and helpful monks like myself could help you with less effort. See Markup in the Monastery.

    Rather than doing your reformatting after the fact, I would make it so that your array of lines (@file) is ready after input. I would accomplish this by slurping the file and then manually splitting where you want breaks. Perhaps something like:

    my @file = do { local $/; $_ = <FILE>; split /(?=>)/; };

    As well, rather than randomly splicing, I would think you'd get similar performance and better clarity by using the shuffle method from (the core module) List::Util. Then you could just print them off in the new order, or shift/pop off values as you needed them.

    Be aware that these approaches are likely sub-optimal for very large files.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: Get random unique lines from file
by BrowserUk (Patriarch) on Aug 17, 2012 at 22:28 UTC

    This script takes the name of the file and the number of sequences you want and outputs that number of randomly selected sequences.

    It is simple, fast and should handle any size of input file with minimal memory usage:

    #! perl -slw use strict; use List::Util qw[ shuffle ]; $/ = '>'; open FASTA, '<:raw', $ARGV[0] or die $!; my @seqPosns; push @seqPosns, tell( FASTA ) while <FASTA>; @seqPosns = shuffle @seqPosns; for ( @seqPosns[ 0 .. $ARGV[ 1 ] // 10 ] ) { seek FASTA, $_, 0; my $seq = <FASTA>; chomp( $seq ); chop( $seq ); print '>', $seq; } close FASTA; __END__ C:\test>988096.pl C:/dell/test/LCS/bioMan.fasta 2 >af418682 TTCCACAACTTTCCACCAAGCTCTACAAGATCCCAGAGTCAGGGGCCTGTATTTTCC TGGGTCTTTTGGGCTTTGCCGCTCCATTTACACAATGTGGTTATCCTGCATTAATGC ACTTCTTTCCTTCAGTACGAGATCTCCTAGATACCGCCTCAGCTCTATATCGGGAAG TCAAACAATCCAGATTGGGACTTCAACCCCATCAAGGACCACTGGCCACAAGCCAAC >ab033557 CTCCACGACATTCCACCAAGCTCTGCTAGATCCCAGAGTGAGGGGCCTTTACTTTCC TGGGTCTTTTGGGCTTTGCTGCCCCTTTTACACAATGTGGCTATCCTGCCTTAATGC ACTTCTTTCCTTCCATTCGAGATCTTCTCGACACCGCCTCTGCTCTGTATCGGGAGG TCAAACAATCCAGATTGGGACTTCAACCCCAACAAGGATCAATGGCCAGAAGCAAAT >x97850 CTCCACAACTTTCCTCCAAACTCTTCAAGATTCCAGAGTCAGGGCCCTGTACCTTCC TGGGTCTTTTGGGGTTTGCCGCCCCTTTCACGCAATGTGGATATCCTGCTTTAATGC ACTTTTTTCCTTCTATTCGAGATCTCCTCGACACCGCCTCTGCTCTGTATCGGGAGG TCAGAAAATCCAGATTGGGACCTCAACCCGCACAAGGACAACTGGCCGGACGCCAAC

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      I like your solution and ours are similar. You know more about the FASTA format than me and were able to "decode" the OP'S intent better than me. I like it!

      I do admit confusion about this "// 10", I don't understand why that would be necessary? Or what the purpose is?

      for ( @seqPosns[ 0 .. $ARGV[ 1 ] // 10 ] ) { ...blah....}

        It simply means that if I forget/omit the second parameter (the number of output records to produce), I get 10 by default.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

Re: Get random unique lines from file
by Marshall (Canon) on Aug 17, 2012 at 22:06 UTC
    The formatting of the code/data and intent was hard for me to understand. But maybe this works for you?:
    #!/usr/bin/perl -w use strict; use Data::Dumper; # a core utility use List::Util qw(shuffle); # a core utility my @data; while (<DATA>) { next if /^\s*$/; #skip blank lines my ($important) = (split(' ',$_))[0]; push @data, $important; } print Dumper \@data; #debug - just to verify the split() # so @data is about 100,000 entries # this is a big array, but not that big # the use of splice() works, but is relatively inefficient # perhaps just use List::Util shuffle to randomize the # indicies? # This will create a list with 100,000 values # but it will do it efficiently. # Within the parameters of what you have said, # 1/4 million array/list elements will work ok, even on # my wimpy Win XP32 bit machine foreach my $index (shuffle (0..@data-1)) { print "$index $data[$index]\n"; } =prints VAR1 = [ 'FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23', 'FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23', 'FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23', 'FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23' ]; 3 FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23 2 FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23 1 FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23 0 FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23 =cut __DATA__ FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTAC +CAGGTCCTGACTTCCTGACTATGGTTATTAGAAATAA TTTCCTTCAGTTCGGCTGGGTCAGTGACAGG +TGATGCATGGCCGTC FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTAACCGATGAACCTTAC +CTACACTTGACATGCAGAGAACTTTCCAGAGATGGAT TGGTGCCTTCGGGAACTCTGACACAGGTGAT +GCATCGCCGTC FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTACGCGAAGAACCTTAC +CTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAACTCTGATACAGGTGAT +GCATGGCTGTC FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTAC +CAACCCTTGACATCCAGAGAATTTTCTAGAGATAGAT TTGTGCCTTCGGGAACTCTGTGACAGGTGAT +GCATGGCTGTC
Re: Get random unique lines from file
by roboticus (Chancellor) on Aug 17, 2012 at 22:54 UTC

    radnorr:

    Some time ago, someone here at PM showed a pretty cool way to select a random line from a long file. Basically, as you read the file, you store the line if rand gives you a value lower than 1 / (line number). I tried to generalize it here:

    #!/usr/bin/perl # # sample_random_lines_from_file.pl <FName> <NumSamples> use 5.10.1; use strict; use warnings; use autodie; my @samples; my $FName = shift // die "missing: <filename> <numsamples>"; my $num = shift // die "missing: <numsamples>"; open my $FH, '<', $FName; while (<$FH>) { if ($num/$. > rand) { my $i = @samples; if ($i > $num) { $i = rand @samples; } #print "slot $i, size=" . scalar(@samples) . ", line $.\n"; $samples[$i]=[ $., $_ ]; } } print "random samples:\n"; print $$_[1] for sort { $$a[0] <=> $$b[0] } @samples;

    I haven't tested it extensively: It works, but I haven't convinced myself that it doesn't have a bias yet. Anyway, the little testing I did was first to generate a file with a million lines in it, and run it a few times:

    $ perl -e 'print "$_\n" for 1 .. 1000000' >a_million_lines marco@Boink:/Work/Tools/SQL/parser $ perl pm_sample_lines_from_file.pl a_million_lines 10 random samples: 29748 135818 143918 164669 216447 245165 267754 404776 419876 487740 893947 marco@Boink:/Work/Tools/SQL/parser $ perl pm_sample_lines_from_file.pl a_million_lines 10 random samples: 163918 434324 435340 534748 596221 611074 677311 682939 719979 842687 998139

    There may be a "bias" in it, in that there may be a preference for one end or the other. I haven't played with it enough to determine whether it has a bias, nor figured out a way to correct it if it does. Anyway, the changes I made to adapt the algorithm are rather simple: Instead of having a probability of 1/(line number) as the indicator whether to keep a line, I use (desired num samples)/(line number) as a flag to store the line. Then I select a random slot in the @samples array to stuff the line into (after we gather enough samples to fill @samples).

    I hope you find it useful.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      radnorr:

      I had a little more time on my hands, so I tested it a bit. It turns out that there's a slight bias for lines from the start of the file. I haven't done any of the math for it, so I don't know if it'll be easy to fix or not. (The bias is small enough that I'm thinking it might be something as simple as never overwriting the last entry in @sample or something. So it might get an early line and have it stick around until the end.)

      I don't really have the time or inclination to run it down at the moment, but if I were going to do so, I'd do a couple things:

      • Run my test a couple more times with different numbers of samples, to see if I could easily vary the slope of the bias.
      • Make the sample array have *3* slots per entry instead of 2. I'd use the third slot to hold the number of times the slot was overwritten and histogram that as well. Just in case it's something like the fencepost error mentioned earlier.

      Anyway, to test it, I did 1000 runs of 25 samples against the million-line file generated in my previous post:

      $ for J in {0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7, +8,9}; do ./pm_sample_lines_from_file.pl a_million_lines 25 >t.$J; don +e

      While it was cooking, I quickly put together something to crunch the files and check:

      #!/usr/bin/perl + # # pm_sample_histo_test.pl <FNames> # # Make a brief histogram to see if there's any obvious bias in the sam +pler. # Assumes you're sampling from a file containing a million lines, each + of # which contains just the line number. # use 5.10.1; use strict; use warnings; my %H; for my $FName (@ARGV) { print "-- $FName --\n"; open my $FH, '<', $FName or die "can't open $FName\n"; while (<$FH>) { next if /^random/; my $I = int $_/10000; $H{$I}++; } } my $max=0; for (keys %H) { $max = $H{$_} if $H{$_} > $max; } for my $I (0 .. 99) { my $bar = substr('*' x int(50 * ($H{$I}/$max)) . ' 'x50, 0, 50); if ($I > 1 and $I < 98) { my $avg; my $sum = 0; $sum += $H{$_} for $I-2 .. $I+2; $avg = $sum/5; substr($bar,int(50 * ($avg/$max)),1)='+'; } printf "% 3u: (% 3u) %s\n", $I, $H{$I} // 0, $bar; }

      It accumulates the samples from the million-line file into 100 buckets (0..10000 for the first one, 10001..20000 for the second one, etc.). It then plots a histogram, and overlays it with a 5-sample moving average:

      $ ./pm_sample_histo_test.pl t.* 0: (321) ************************************************** 1: (290) ********************************************* 2: (290) *********************************************+ 3: (291) *********************************************+ 4: (281) ******************************************* + 5: (307) *********************************************+* 6: (290) ********************************************+ 7: (283) ********************************************+ 8: (280) ******************************************* + 9: (283) ********************************************+ 10: (309) ********************************************+*** 11: (287) *******************************************+ 12: (267) ***************************************** + 13: (253) *************************************** + 14: (292) *****************************************+*** 15: (257) **************************************** + 16: (278) ******************************************+ 17: (254) *************************************** + 18: (286) ******************************************+* 19: (250) ************************************** + 20: (291) ******************************************+** 21: (277) ******************************************+ 22: (270) ******************************************+ 23: (262) **************************************** + 24: (261) ****************************************+ 25: (263) ***************************************+ 26: (246) ************************************** + 27: (237) ************************************ + 28: (261) ***************************************+ 29: (245) ************************************** + 30: (265) ***************************************+* 31: (267) ***************************************+* 32: (234) ************************************ + 33: (269) ****************************************+ 34: (252) *************************************** + 35: (272) ****************************************+* 36: (259) ****************************************+ 37: (243) ************************************* + 38: (266) ***************************************+* 39: (237) ************************************ + 40: (259) ***************************************+ 41: (256) ***************************************+ 42: (249) ************************************** + 43: (269) ****************************************+ 44: (274) *****************************************+ 45: (266) ****************************************+ 46: (267) ***************************************+* 47: (229) *********************************** + 48: (233) ************************************ + 49: (265) ***************************************+* 50: (247) ************************************** + 51: (281) ****************************************+** 52: (243) ************************************* + 53: (249) ************************************** + 54: (249) ************************************** + 55: (251) **************************************+ 56: (262) *************************************+** 57: (228) *********************************** + 58: (217) ********************************* + 59: (262) *************************************+** 60: (260) **************************************+* 61: (225) *********************************** + 62: (263) ****************************************+ 63: (297) ****************************************+***** 64: (271) ******************************************+ 65: (250) ************************************** + 66: (275) *****************************************+ 67: (267) ***************************************+* 68: (260) ***************************************+ 69: (224) ********************************** + 70: (256) **************************************+ 71: (241) ************************************* + 72: (257) ***************************************+ 73: (256) ***************************************+ 74: (257) ***************************************+ 75: (246) ************************************** + 76: (267) ***************************************+* 77: (267) **************************************+** 78: (232) ************************************ + 79: (239) ************************************* + 80: (252) **************************************+ 81: (269) **************************************+** 82: (234) ************************************ + 83: (239) ************************************* + 84: (240) ************************************* + 85: (259) ***************************************+ 86: (290) ****************************************+**** 87: (249) ************************************** + 88: (256) ***************************************+ 89: (235) ************************************ + 90: (242) ************************************* + 91: (261) **************************************+* 92: (245) **************************************+ 93: (245) ************************************** + 94: (248) **************************************+ 95: (274) ***************************************+** 96: (230) *********************************** + 97: (263) **************************************+* 98: (241) ************************************* 99: (236) ************************************

      As you can see, at the bottom side there are 290-ish lines selected, and at the top end, there are 250-ish lines selected.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      The problem with this is that it only works with individual lines, but FASTA files contain multi-line records each consisting of a one header line and a variable number of payload lines.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

        BrowserUk:

        Ah, well, I know jack about FASTA files, so I didn't consider that. Of course, by changing the reader to accumulate records instead of lines, it could be adapted. Though since there are already a couple working examples from you and Marshall, and since mine has a bias in it, there's no real reason to do so.

        I know that *you* know how to do the changes, but if someone tripping across this node in the future wants to do it, you can do so something (untested!) like this:

        my @record; while (<$FH>) { if (/start of record marker/) { ++$cnt_recs; if ($num/$cnt_recs > rand) { my $i=@samples; if ($i > $num) { $i = rand @samples; } $samples[$i]=[$cnt_recs, [@record]]; } } else { # Accumulate record push @record, $_; } }

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Get random unique lines from file
by Cristoforo (Curate) on Aug 17, 2012 at 21:25 UTC
    Look at the BioPerl package, (in particular Bio::SeqIO and Bio::Seq).

    Bio::SeqIO provides a method for opening a FASTA file and Bio::Seq provides methods to access the particular info you're after.

    Chris

    Update: I just now realized that doesn't solve the problem of getting the random lines you want.  :-(