http://qs321.pair.com?node_id=664760

bioinformatics has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,
I rarely post these days, but I'm a bit stumped currently and need more pairs of eyes to examine my code to see where the problem lies. My code "works" to some degree, in that it gives me an output, but the program doesn't end. Like with anything I do, I really need to make sure that I'm getting the correct data.

Background: I have two files, one in excess of 300k lines long. I need to split the lines, compare the values, and look for matches so that I can pull the gene name from the annotation file and add it to the original values from the data file. Essentially, its a glorified V-lookup. The catch is that I'm not using pattern matching, but equality based on the "window" values, or numbers that tell me where the feature would be found on the genome.

My guess is that I have the flow of the operations mixed up somewhere, possibly in the main loop. If you have any suggestions, I would be grateful. I also realize that I could have gone about this using a binary search or something similar, but I figured that this would be the simplest and quickest way to get at the results. oops :)
Thanks so much!!!
#!/usr/bin/perl -w use strict; our (@window, @probe)=(); our @main = &open_file_main(); our @annot = &open_file_annot(); &outfile; # loop through the main data file OLC: foreach my $md (@main) { # remove newlines chomp $md; # pull out chromosome #, window start, end my ($main_chrom, $winl, $winr) = split(/\t/, $md); # put the window start, end into array for further processing @window = ($winl, $winr); # loop through the "annotation" file ILC: foreach my $ad (@annot) { # pull out the chromosome #, window start, end my ($an_chrom, undef, undef, $prol, $pror, undef, undef, undef +, $mess) = split(/\t/, $ad); # make sure the chromosomes match to save processor time, skip + if no match if ($main_chrom ne $an_chrom) {next ILC;} # get the gene name fromt the $mess variable my (undef, undef, $name) = split(/\;/, $mess); # load the window start sites for the individual probes @probe = ($prol, $pror); # call the range_finding sub to look for matches my $return = range_find(); if ($return eq 1) { # upon matching, print out the name of the gene along with + the original values # and print OUTPUT "$name\t $md\n"; next OLC; } else {next ILC;} } } close OUTPUT; exit; sub open_file_main { # open the file, pull in the data print " What is the name of the ChIPOTle file\?"; chomp (my $file = <STDIN>); open (FILE, $file) || die "Cannot open $!"; my @data = <FILE>; close FILE; return @data; } sub open_file_annot { # open the file, pull in the data print " What is the name of the annotation file\?"; chomp (my $file = <STDIN>); open (FILE, $file) || die "Cannot open $!"; my @data = <FILE>; close FILE; return @data; } sub outfile { # open the outputfile open (OUTPUT, ">output.txt")|| die "Cannot open $!"; } sub calc_range { # simple sub to create the full window from the start + and end sites my @peaks = @_; my @peak_range = ($peaks[0] .. $peaks[1]); return @peak_range; } sub range_find { # loop in loop to look for ANY overlap of the values; + will be a true/false return my @range1 = &calc_range(@window); my @range2 = &calc_range(@probe); my $test = pop @range2; my $test2 = pop @range1; OLC2: foreach my $value1 (@range1) { # look to see if the windows don't overlap if ($test lt $window[0]) {return 0;} # look to see if the windows don't overlap elsif ($test2 lt $probe[0]) {return 0;} ILC2: foreach my $value2 (@range2) { if ( $value1 eq $value2) {return 1;} else {next ILC2;} } } return 0; }
Main data file:
chr10 726178 726428 1.121753867297440e-012 1.1607063976283 +87e-015 6 4.81000 chr10 4922028 4922428 2.163402534569952e-012 7.06701849934 +5696e-014 9 4.43000 chr10 5126478 5127178 8.348017333255820e-013 8.24324954124 +5406e-021 15 4.42000 chr10 14649778 14650028 2.090598548899472e-013 6.000728326 +245207e-018 6 5.04000 chr10 14651328 14651428 2.915017653920210e-012 1.890051416 +361198e-014 3 4.20000
Annotation Data:
chr1 NGS primary_transcript 4224 7502 -1 - . I +D=00003; accession=BC063682; Name=FLJ25222; ncbi_gene_id=374666; syno +nyms=-; description=CXYorf1-related protein; url=http://www.ncbi.nlm. +nih.gov/entrez/query.fcgi?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC0 +63682 chr1 NGS transcription_start_site 7502 7502 -1 - +. Parent=00003; accession=BC063682; Name=FLJ25222; ncbi_gene_id=37 +4666; synonyms=-; description=CXYorf1-related protein chr1 NGS primary_transcript 4268 7438 -2 - . I +D=00005; accession=BC073913; Name=MGC52000; ncbi_gene_id=375260; syno +nyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; description=CXYo +rf1-related protein; url=http://www.ncbi.nlm.nih.gov/entrez/query.fcg +i?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC073913 chr1 NGS transcription_start_site 7438 7438 -2 - +. Parent=00005; accession=BC073913; Name=MGC52000; ncbi_gene_id=37 +5260; synonyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; descri +ption=CXYorf1-related protein chr1 NGS primary_transcript 4268 14754 -3 - . +ID=00004; accession=BC048328; Name=MGC52000; ncbi_gene_id=375260; syn +onyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; description=CXY +orf1-related protein; url=http://www.ncbi.nlm.nih.gov/entrez/query.fc +gi?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC048328 chr1 NGS transcription_start_site 14754 14754 -3 - + . Parent=00004; accession=BC048328; Name=MGC52000; ncbi_gene_id= +375260; synonyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; desc +ription=CXYorf1-related protein chr1 NGS primary_transcript 4268 19697 -4 - . +ID=00006; accession=BC110996; Name=MGC52000; ncbi_gene_id=375260; syn +onyms=CXYorf1|MGC104889|MGC111476|MGC117230|MGC90409; description=CXY +orf1-related protein; url=http://www.ncbi.nlm.nih.gov/entrez/query.fc +gi?db%3Dnucleotide%26cmd%3Dsearch%26term%3DBC110996
Bioinformatics

Replies are listed 'Best First'.
Re: Faulty Control Structures?
by Narveson (Chaplain) on Jan 28, 2008 at 20:50 UTC

    I don't have a fix for you, but instead I'm offering some refactoring advice.

    1. Load what you need from the annotation file into a hash.
    2. Might as well do your processing one line at a time, no need to have the whole file in memory.

    Here follows some code to implement these first thoughts. I have not changed your use of the global @window, @probe, which would be my next targets for refactoring. But I think after you refactor, you may find it easier to debug.

    open my $annotation_read_handle, '<', $annotation_file; my %annotation_for; while (my $ad = <$annotation_read_handle> ) { # read $an_chrom, $prol, $pror out of $ad my ( $an_chrom, undef, undef, $prol, $pror, undef, undef, undef, $mess, ) = split(/\t/, $ad); # read $name out of $mess my (undef, undef, $name) = split(/\;/, $mess); # store for future lookups $annotation_for{$an_chrom} = [ $name, $prol, $pror ]; } # loop through the main data file OLC: while (my $md = <$main_read_handle> ) { # remove newlines chomp $md; # pull out chromosome #, window start, end my ($main_chrom, $winl, $winr) = split(/\t/, $md); # see if $main_chrom has been annotated next OLC if !exists $annotation_for{$main_chrom}; my $array_ref = $annotation_for{$main_chrom}; ( my $name, @probe ) = @$array_ref; # put the window start, end into array for further processing @window = ($winl, $winr); # call the range_finding sub to look for matches my $return = range_find(); next OLC if !$return; # upon matching, print out the name of the gene along with the ori +ginal values print OUTPUT "$name\t $md\n"; }
      Thanks for your input. Unfortunately, this won't work as there isn't a good selection of unique identifiers to use as Keys for the Hash. So, using the code you provide, I'd end up with 23 key-value combinations, when I need 330k :). A hash of arrays would work better, in that I could have the values appended to the arrays for each chromosome, but then getting the data out would be a bit of a nightmare. I will look to cleaning up the globals though, as I was being a bit lazy there :).
      EDIT: Actually, 24 combinations, as there are both x and y to consider :).
      Bioinformatics

        You're right. I overlooked the statement label in one of your next statements. I could not have arrived at my misreading if I had been as aware as you are that there are only two dozen chromosomes.

        But what about your hash of arrays? Why would getting the data out be such a nightmare?

        Populating the hash of arrays:

        open my $annotation_read_handle, '<', $annotation_file; my %annotations_for; while (my $ad = <$annotation_read_handle> ) { # read $an_chrom out of $ad my ($an_chrom, undef) = split(/\t/, $ad); # store for future lookups push @$annotations_for{$an_chrom}, $ad; } close $annotation_read_handle;

        Now read through the main data file and assign each chromosome number to my $main_chrom.

        # look up the list of annotations relevant to the current chromoso +me my $annotations_ref = $annotations_for{$main_chrom}; # loop through just these annotations ILC: foreach my $ad (@$annotations_ref) { # ... }

        Of course—as other more enlightened commentators have already pointed out—the most important thing to optimize is the range_find subroutine.

Re: Faulty Control Structures?
by Errto (Vicar) on Jan 28, 2008 at 20:33 UTC

    I tried running the program on the data you provided but it exited instantly and printed no output in output.txt so I assume that data is not representative. I don't exactly understand what the code is doing but I don't see anything that could cause an infinite loop. I see two things that make me suspicious: 1) your range data is all numbers but you're using the character-operators lt and eq to compare them, and 2) these lines:

    # look to see if the windows don't overlap if ($test lt $window[0]) {return 0;} # look to see if the windows don't overlap elsif ($test2 lt $probe[0]) {return 0;}
    are inside a loop in which the iterator variable is $value1, yet they don't consider that value.

    Update: Second thought - the second part of the code seems, if I understand it, to be designed to take two ranges of numbers and see if they overlap. To do so, it actually expands the ranges into lists and compares them one-by-one. This becomes, in the worst case, an N^2 calculation for what should be a constant operation:

    sub range_overlap { my ($start1, $end1, $start2, $end2); return ($end1 >= $start2 || $start1 <= $end2) && ($end2 >= $start1 || $start2 <= $end1); }
    untested but it should be about right.

      Mmmm... The data won't match up your right. I can adjust the window values and chromosome numbers to match up so that you do get output however. I just threw up data to illustrate.
      Bioinformatics
Re: Faulty Control Structures?
by BrowserUk (Patriarch) on Jan 29, 2008 at 06:49 UTC

    A few ways to optimise this:

    1. You mentioned somewhere that hashes are not useful as there are only 24 keys. You also mentioned about using a hash of arrays, but dismissed it.

      That's premature. If you load your main file into a hash of arrays, then you only need compare each of your annotated records against 1/24th of the main records. And that reduces your overall workload and runtime to 1/24th of what it would be.

      my %main; my $mfile = prompt( "Main file?" ); open MAIN, '<', $mfile or die "$mfile: $!"; m[^([^\t]+)\t] and push @{ $main{ $1 } }, $_ or die "Bad record in $mfile @ $." while <MAIN>; close MAIN; chomp @$_ for values %main;
    2. There is no reason to load the second file into ram. You can process it one record at a time:
      my $afile = prompt( "Annotation file?" ); open ANNOT, '<', $afile or die "$afile: $!"; open OUTPUT, '>', 'output.txt' or die $!;

      Now your main loops become:

      OUTER: while( <ANNOT> ) { chomp; my( $atag, $astart, $aend, $mess ) = split "\t"; next unless exists $main{ $atag }; for my $md ( @{ $main{ $atag } } ) { my( $mtag, $mstart, $mend ) = split "\t", $md; if( ?OVERLAPPED? ) { print OUTPUT "$atag\t$mess"; } } }
    3. As others have pointed out, your range checking is far too complex and computationally expensive. Although there are many (7?) permutations of relative positioning of the starts and ends of the ranges, there are only 2 that do not overlap:
      1. Main end less than Annot start:
        Ms------Me As------Ae
      2. Annot end less than Main start:
        Ms------Me As------Ae

      Any other position is overlapped.

      If both your files are already in sorted order (if they are not, (learn to) use your system sort utility and sort them ascending by the tag, start, end), then when Main Start moves greater than Annot End, then you need process no further main records for this tag, so we can skip to the next Annot record. (Outer loop)

      Equally, whilst Main End is less than Annot Start, you can skip to the next Main record. (Inner loop).

      And actually, that covers both (all) conditions when we do not want to print output, so no further range testing is required.

      That makes the loops look like this:

      OUTER: while( <ANNOT> ) { chomp; my( $atag, $astart, $aend, $mess ) = split "\t"; next unless exists $main{ $atag }; for my $md ( @{ $main{ $atag } } ) { my( $mtag, $mstart, $mend ) = split "\t", $md; next if $mend < $astart; next OUTER if $mstart > $aend; print OUTPUT "$atag\t$mess"; } }
    4. Finally, there is one further optimisation possible. As both files are sorted, when we start processing the second Annot record for a given tag, then we know it cannot be earlier than the last Annot record for this tag.

      Therefore, if we rejected and skipped the first N Main records for this tag for the last Annot record, we know we can skip at least that many for this record. This complicates the code a little, but yields a huge saving in processing which make it worth while.

    The final cut of the loops code looks like this:

    my( $first, $lastTag ) = ( 0, '' ); OUTER: while( <ANNOT> ) { chomp; ## ((simplified for testing) my( $atag, $astart, $aend, $mess ) = split "\t"; ## If the tag changed, reset the start position for the Main recor +ds ## and remember the new tag $first = 0, $lastTag = $atag if $atag ne $lastTag; ## Skip completely if this Annot tag has no corresponding Main tag next unless exists $main{ $atag }; ## For each Main record with this tag, ## But skipping those we rejected last time for my $md ( @{ $main{ $atag } }[ $first .. $#{ $main{ $atag } } ] + ) { ## (simplified for testing) my( $mtag, $mstart, $mend ) = split "\t", $md; ## Increment the skip and goto the next Main record ## If the end of this Main record is less than ## the start of the Annot record ++$first, next if $mend < $astart; ## Skip the rest of these Main record ## if the start of this Main record is greater than ## the end of this Annot record next OUTER if $mstart > $aend; ## We have an overlap, so output the info print OUTPUT "$atag\t$mess"; } }

    The result is that it now takes less that six minutes to process 300,000 record against 300,000 records, finding 113,000 overlaps:

    C:\test>664760-gen > 664760.main C:\test>wc -l 664760.main 300000 664760.main C:\test>664760-gen > 664760.annot C:\test>wc -l 664760.annot 300000 664760.annot C:\test>664760-b Main file?664760.main Annotation file?664760.annot Tue Jan 29 06:36:34 2008 Tue Jan 29 06:42:12 2008 C:\test>wc -l output.txt 112805 output.txt

    My best guess is this 3 if not 4 orders of magnitude quicker. Let's see anyone come close to that using an RDBMS.

    Full test code:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Faulty Control Structures?
by dragonchild (Archbishop) on Jan 28, 2008 at 22:46 UTC
    My first thing to check would be the fact that you have nested for-loops that may just be taking a very very very long time to run. If you have small files providing the data for @main and @annot, the for-loops will run quickly. If you have large files for both, then the runtime of the innermost loop will be the main determiner of the runtime of the script. Given that you then have 2 more nested for-loops in range_find(), your runtime is going to be on the order of O(n^4). This is usually considered to be poor.

    The biggest suggestion I would have (after the obvious algorithmic improvement) is to pre-digest your data so that repetitive checks can be sped up or eliminated. The next thing would be to look at using a relational database (like MySQL or Oracle) and letting the power of relational calculus solve your problems.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      Your right. I tried to provide checkpoints so that I wouldn't have to crawl through all the data (aka, making sure the chromosomes considered match, etc.), but even then there is a lot to go through. Heck, I even tried to write a multi threaded version of this program, but it has a few bugs still. Should it really take this long though? I don't have any benchmarks written in, so that will probably be my next step. I'll look into putting this into a database to do some of this for me, but sql is still a bit of a learning curve for me.
      Bioinformatics
Re: Faulty Control Structures?
by apl (Monsignor) on Jan 28, 2008 at 20:31 UTC
    but the program doesn't end

    Do you mean the program doesn't stop even when you hit return in response to " What is the name of the ChIPOTle file\?"?

    Do you mean the program never finds the end of the file it's processing?

    My code "works" to some degree, in that it gives me an output

    Under what circumstance does it work? For the first file only? For more than one file?

      Good questions. The program works on small files, i.e. smaller annotation files of just a few lines that I used to test. They are the same format as the full files. When using the full files however, the program would not exit, even after I started it just prior to leaving on Friday, to come back to it early this morning (didn't think it would take that long :p). I did get information printed in the output file however. The program can find the end of the files, and is designed to annotate just one file at a time.
      Bioinformatics
Re: Faulty Control Structures?
by GrandFather (Saint) on Jan 29, 2008 at 05:00 UTC

    Seems to me your main block should look more like:

    #!/usr/bin/perl -w use strict; my @main = &open_file_main (); my @annot = &open_file_annot (); open OUTPUT, '>', 'output.txt' or die "Cannot open $!"; # loop through the main data file foreach my $md (@main) { chomp $md; my ($main_chrom, $winl, $winr) = split (/\s{2,}/, $md); foreach my $ad (@annot) { my ($an_chrom, undef, undef, $prol, $pror, undef, undef, undef +, $mess) = split (/\s{2,}/, $ad); next unless $main_chrom eq $an_chrom; my (undef, undef, $name) = split (/\;/, $mess); next if $winr < $prol or $winl > $pror; print OUTPUT "$name\t $md\n"; #last; } } exit;

    Note that there is no need for a sub to open the output file! If you need to do complicated stuff in the context of opening the output file, return the file handle from the sub - do not use a 'global' variable that is set as a side effect!


    Perl is environmentally friendly - it saves trees