Dear Venerable Monks

A1 Transcendence has asked for the wisdom of the Perl Monks concerning the following question:

Enlightened Monks, I have recently taken up Perl programming and i am falling in love with it. I am learning new things every day and having a blast;).

Today i am writing in order to obtain some wisdom on my current topic. My task is to write a fairly simple Perl program that 1) Opens a large 1000 genome file. 2) Opens 75 different Files each with 50 IDentifiers 3) In a loop...Starting with File 1 Identifier 1 if (File 1 ID 1 matches 2nd column of the 1000 genome file) extract the whole line from the 1000 genome file and store it in an output that is labeled File1_out. Lastly increase the ID to the 2nd element of 50. Do this for all the identifier in File 1, then after 50th ID open File2 and do the same thing again. Basically do this for all of the 75 Files. I think i should only open each of the files one time and work with arrays. Below is my script that i have been working/struggling on for a couple of days. I know the task can be done in under 10 lines but i have over complicated my code and need some help. Below is what i have done so far. Please share some of your ideas oh wise monks.

 #!/usr/bin/perl

$CHR_NUM = $ARGV[0];

$file_location = "/home/ALL.chr$CHR_NUM.phase1_release_v3.20101123.snp
+s_indels_svs.genotypes.vcf.gz";

for $x (1..75) {
        open (FH, "/home/raj/HAPLOTYPES_JAN_2015/PROG_1_Approach_2/C22
+-$x")|| die "Cannot";
        while (<FH>){
                chomp $_;
                push (@{$x} , $_);#this loop opens the 75 files with i
+dentifier and stores in arrays
        }
}

open(IN, "zcat $file_location |") || die "can't open CHROMOSOME $CHR_N
+UM input file 1000G!";

$g=1;           $z=0; 

while (<IN>) {

        $choice = 1;
        $x=$g; 
        $y=$z; 
        chomp $_;
        $counter ++;
        LABEL:
        if ($choice == 1){ 
                if ($counter > 29) {
                #       print "$_\n";

                        @currentline = split (/\t/, $_);
                        #print "$currentline[1]\n";
                        for $r($x..75) { # $r will be used as the name
+s for files 1-75
                                for $t ($y..$#$r) {  
                                        if ($currentline[1] == $$r[$t]
+) { 
                                                open ($r,">>$r"); 
                                                print $r "$_\n"; 
                                                $z = $t++;
                                                $choice = 0;
                                                goto LABEL;
                                        }
                                }
                        }
                }
        }
        if ($z >= $#$x){   
                $g = $x++;
                $z = $y = 0;
                close $r;
                if ($x==76){# 
                        last;
                }
        }

}
[download]

I know this should be done using only several lines but my shortcut techniques are still not advanced. Any wisdom will be greatly appreciated.

Comment on Dear Venerable Monks Download Code

Replies are listed 'Best First'.
Re: Dear Venerable Monks by Laurent_R (Canon) on Sep 14, 2015 at 21:44 UTC
Perhaps an attempt at a possible rewrite of the beginning of your code, just to give an example of modern best practices: #!/usr/bin/perl use strict; use warnings; my $CHR_NUM = $ARGV[0]; my $file_location = "/home/ALL.chr$CHR_NUM.phase1_release_v3.20101123. +snps_indels_svs.genotypes.vcf.gz"; my @id_files; # you might find a better variable name for th +at array of arrays for my $x (1..75) { # $x is a poor name for a variable, perhaps $f +ile_nr would be better my $file_in = "/home/raj/HAPLOTYPES_JAN_2015/PROG_1_Approach_2 +/C22-$x"; open my $FH, $file_in or die "Cannot open $file_in $!"; # if o +ne fails, you might be happy to know which one of the 75 files while (my $line = <$FH>){ chomp $line; push @{$id_files[$x]}, $line; } close $FH; } [download] Besides adding `use strict;` and `use warnings;`, which in turns forces me to declare all variables (with the `my` keyword here), the main change is to use an array of arrays to store the data read in your 75 files. You'll end up with something looking like that: `0 ARRAY(0x600500ae8) 0 empty slot # $x starts as 1, so no element for subscript 0 1 ARRAY(0x600500c68) 0 'line1 file 1' 1 'line2 file 1' 2 ' ... etc.) 2 ARRAY(0x600500b30) 0 'line 1 file 2' 2 ' ... etc. ... etc. 75 ARRAY(0x6005.....) 0 'line 1 file 75' 2 ' ... etc.` [download] I hope this makes the structure clear, don't hesitate to ask if you have any difficulty using such an array of arrays. The basic idea, if you need to read line 2 of the array corresponding to $x is this: `print $id_files[$x][2];` [download] I am not going any further i your code because I don't understand your code, we would really need to see samples of the input data to understand what you do. Please use meaningful variable names, this will make your life much easier. One last comment. This "goto"! Besides the fact that using such `goto` is usually very much frown upon in general and has been frown upon for about 40 to 45 years (I have never felt I needed to use one in Perl, well, at least not this form of goto which tends to break of the tenets of structured programming), it seems to make little sense to set `$choice` to 0 and, immediately thereafter, to `goto`to a place where you test whether $choice is equal to 1. This really looks like a design defect. If you just need to exit all the nested loops, there are better ways to control that, using `last` or sometimes `next`, in this case probably with a label. I hope this helps.	[reply] [d/l] [select]
Re^2: Dear Venerable Monks by A1 Transcendence (Initiate) on Sep 14, 2015 at 22:01 UTC
Thank You Laurent and Everyone else. All of your suggestions are very helpful and have already improved my abilities. I will get samples of the data input and rewrite my code before i post again. TYVM	[reply]
Re^3: Dear Venerable Monks by poj (Abbot) on Sep 15, 2015 at 13:06 UTC
Are the IDs in the 75 files unique to each file or does the same ID appear in more than 1 file ? I assume the genotypes.vcf.gz file is several GB's, is that correct ? poj	[reply]
Re: Dear Venerable Monks by NetWallah (Canon) on Sep 14, 2015 at 20:07 UTC
As Anonymous Monk has indicated, you have seriously abused "$x" to the point where your code deteriorates to nonsense, and IMHO, unmaintainable, probably unworkable. I would suggest a complete overhaul, using strict and warnings, declaring meaningful variable names. To optimize memory consumpion, read the 1000-line file into memory, then read and process each 50-line file (1000 < 50 * 50). Please re-post when you have a new outline, and request review when you have something that compiles, but may have a specific problem (which I'm sure you will encounter). This is part of the learning process. Software efficiency halves every 18 months, thus compensating for Moore's Law.	[reply]
Re: Dear Venerable Monks by 2teez (Vicar) on Sep 14, 2015 at 19:46 UTC
Welcome to The Monastery A1 Transcendence, Before your program is wheeled into the theater, here is the first wisdom to live by using perl! Always use these two lines in your Perl Program `use warnings; use strict;` [download] Secondly, You might have to give a sample of your input data and what you are expecting. Please check this How do I post a question effectively? If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply] [d/l]
Re: Dear Venerable Monks by biohisham (Priest) on Sep 15, 2015 at 08:53 UTC
Today i am writing in order to obtain some wisdom on my current topic. My task is to write a fairly simple Perl program that 1) Opens a large 1000 genome file. 2) Opens 75 different Files each with 50 IDentifiers 3) In a loop...Starting with File 1 Identifier 1 if (File 1 ID 1 matches 2nd column of the 1000 genome file) extract the whole line from the 1000 genome file and store it in an output that is labeled File1_out. Lastly increase the ID to the 2nd element of 50. Do this for all the identifier in File 1, then after 50th ID open File2 and do the same thing again. Basically do this for all of the 75 Files. I think i should only open each of the files one time and work with arrays. You have not included samples from these files, that would have significantly helped us visualize your idea. Is the second column in the 1000 genome file a checksum val or ? Or does this file have another custom format ? How about the other 75 files, what do they have? My suggestion is to utilize hashes since you're not concerned with the ordering of the values in the 1K genome file but rather whether or not these values have matches in the other files. Hashes facilitate quick searching. So for the 1K file, read the second column into a hash: #Untested code for lack of example input by the OP use strict; #enforces predeclaration of variables, better scoping. use warnings; #tells you of errors or violations in your code use Data::Dumper; #visualize your data structures my %hash; #declaring a global variable to hold the desired column open(my $fh, "<", "1k_genome_file.bas") or die("could not open file $! +\n"); while(my $line = <$fh>){ chomp $line; #split around a delimiter (a space, a tab, a comma...etc). my @array=split(/\t/,$line); #get the second column my $second_column=$array[1]; $hash{$second_column}=1; } print Dumper(\%hash) #see if the hash looks like what is expected. [download] In the line `$hash{$second_column}=1;`: entries in second column are used as hash keys, giving each entry a value of 1. Repeated entries will be overwritten, this way you avoid duplicated lines in the 1k file. Alternatively, if your second column is made up of unique entries, you can use the line number where the entry in the file occurred as your hash value (that will be helpful later when you want to access the lines to be extracted). Here is a list of relevant posts reading the file using the line number, Line number in a file and Best way to read line x from a file. Now that you have read the desired column into a hash, you can then iterate over the other files in the folder and for each file check if the hash keys match that line then extract them from the 1k file. The module File::Find is a friendly way to iterating over folders. As other monks have suggested, turning on strict and warnings is a good coding behavior Something or the other, a monk since 2009	[reply] [d/l] [select]
Re: Dear Venerable Monks by Anonymous Monk on Sep 14, 2015 at 19:58 UTC
Lines 7 through 11: interesting use of a symbolic reference to create arrays named @1, @2, and so on. Stopped reading from that point on. BTW: You didn't mention; does this code actually do what it is supposed to do? It would certainly benefit from a rewrite using modern Perl techniques (test open return values always, use lexical filehandles, use lexical variables, be strict/warnings compliant, don't use symbolic refs, and so on), but if it's already working.........	[reply]
Re: Dear Venerable Monks by A1 Transcendence (Initiate) on Sep 30, 2015 at 11:42 UTC
THANK YOU SO MUCH EVERYONE YOU GUYS ARE THE BEST :)	[reply]


We don't bite newbies here... much
	PerlMonks