http://qs321.pair.com?node_id=1142025


in reply to Dear Venerable Monks

Today i am writing in order to obtain some wisdom on my current topic. My task is to write a fairly simple Perl program that 1) Opens a large 1000 genome file. 2) Opens 75 different Files each with 50 IDentifiers 3) In a loop...Starting with File 1 Identifier 1 if (File 1 ID 1 matches 2nd column of the 1000 genome file) extract the whole line from the 1000 genome file and store it in an output that is labeled File1_out. Lastly increase the ID to the 2nd element of 50. Do this for all the identifier in File 1, then after 50th ID open File2 and do the same thing again. Basically do this for all of the 75 Files. I think i should only open each of the files one time and work with arrays.

You have not included samples from these files, that would have significantly helped us visualize your idea. Is the second column in the 1000 genome file a checksum val or ? Or does this file have another custom format ? How about the other 75 files, what do they have?

My suggestion is to utilize hashes since you're not concerned with the ordering of the values in the 1K genome file but rather whether or not these values have matches in the other files. Hashes facilitate quick searching. So for the 1K file, read the second column into a hash:

#Untested code for lack of example input by the OP use strict; #enforces predeclaration of variables, better scoping. use warnings; #tells you of errors or violations in your code use Data::Dumper; #visualize your data structures my %hash; #declaring a global variable to hold the desired column open(my $fh, "<", "1k_genome_file.bas") or die("could not open file $! +\n"); while(my $line = <$fh>){ chomp $line; #split around a delimiter (a space, a tab, a comma...etc). my @array=split(/\t/,$line); #get the second column my $second_column=$array[1]; $hash{$second_column}=1; } print Dumper(\%hash) #see if the hash looks like what is expected.

In the line $hash{$second_column}=1;: entries in second column are used as hash keys, giving each entry a value of 1. Repeated entries will be overwritten, this way you avoid duplicated lines in the 1k file. Alternatively, if your second column is made up of unique entries, you can use the line number where the entry in the file occurred as your hash value (that will be helpful later when you want to access the lines to be extracted). Here is a list of relevant posts reading the file using the line number, Line number in a file and Best way to read line x from a file.

Now that you have read the desired column into a hash, you can then iterate over the other files in the folder and for each file check if the hash keys match that line then extract them from the 1k file. The module File::Find is a friendly way to iterating over folders.

As other monks have suggested, turning on strict and warnings is a good coding behavior


Something or the other, a monk since 2009