http://qs321.pair.com?node_id=988096

radnorr has asked for the wisdom of the Perl Monks concerning the following question:

Hi I have a bit of a snag with my perl code. I am trying to take a random unique sampling of a file containing >100,000 sequences. But right now the way the data is listed I can't seem to figure out how to write a good code.

What I have so far.....>

#!/user/bin/perl -w + # Use your own path + use strict; # always use strict + my $file = "josh_vamps_seqs_ALL.fa"; open (FILE, $file) or die "$0: $!"; # printing errors is a Good Thing +(tm) my @file = <FILE>; # Store all the lines in an array + close FILE; open OUT, ">rarefaction.fa" or die "$0: $!"; # Caution: will overwrite + any existing files my $i; for ($i = 0; $i<4266;$i++) { # Start sampling loop + my $rand = int(rand($#file)); # rand gives you a random number between 0 and the number of lines rem +aining in the array, note # that this number is by necessity dynamic, it will drop 1000, 999, 99 +8, etc. my $sample = splice @file, $rand, 1; # Cut out the $rand-th line, +offset 1 means just one line print OUT $sample; } # end of for loop

What it gives me ......>

CTTTTCTTCGGACTACTTACAAGGTGTTGCATGGTCGTC >FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09 CGAGTCAACGCGCAGAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA TCTCTATGCGCAACGCGAAAACCTTACCTGGCCTTGACATGCATCTCTAAGCGTGTGAAA >FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26 TGGTGCCTTCGGGAACGCAGTGACAGGTGATGCATGG AAACCCTCAGAGACTTCGGTTAATGACATGTTTACAGGTGATGCATGGCCGTCG >E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22 TTCGGTTCGGCCGGACGAAACACAGGTGT TAGTGCGACGCGAAGAACCTTACCAGGGCTTAAATGTAGTGGGACAGGTCTAGAGATAGA GGGTGCCCTTCGGGGAATCTAGTGAGAGGTGTTGCATGGCCGTCG GTGAGCAACGCGCAGAACCTTACCAACCCTTGACATCCTGTGCTACTACCAGAGATGGTA TACATCTACGCGAAGAACCTTATCTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAATCTAGTGACAGGTGATGCATGGCTGTCG CACACCAACGCGAAAAACCTTACCAACACTTGACATGTTCGTCGCGACTCTAAGAGATTA TTCGGTTCGGCCGGACGAAACACAGGTGTTGCATGGCTGTC

What I actually want .........>

FW4WBAJ01DVAX5.ICM_PML_Bv6.PML_43_2003_06_09 FMS0R7002J2YH1.ICM_CAM_Bv6.CAM_0011_2000_03_26 E6SXMJY02I00IR.ICM_BMO_Bv6.BMO_0005_2007_09_22

What input file looks like ......>

>FRZPY5Q02F00L9.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTACCAGGTCCTGACTTCCTGACTATGGTTATTAGAAATAA TTTCCTTCAGTTCGGCTGGGTCAGTGACAGGTGATGCATGGCCGTC >FRZPY5Q02F00U8.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTAACCGATGAACCTTACCTACACTTGACATGCAGAGAACTTTCCAGAGATGGAT TGGTGCCTTCGGGAACTCTGACACAGGTGATGCATCGCCGTC >FRZPY5Q02F01NC.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCTACGCGAAGAACCTTACCTACACTTGACATACAGAGAACTTACCAGAGATGGTT TGGTGCCTTCGGGAACTCTGATACAGGTGATGCATGGCTGTC >FRZPY5Q02F023C.ICM_AWP_Bv6.AWP_0001_2007_08_23 ACTGCCAACGCGCAGAACCTTACCAACCCTTGACATCCAGAGAATTTTCTAGAGATAGAT TTGTGCCTTCGGGAACTCTGTGACAGGTGATGCATGGCTGTC

I don't know if there is a way for the random function of perl to recognize a specific piece of the input say the ">" and then print the line that follows? Any pointers would be greatly appreciated.>