Re: Faulty Control Structures?

I don't have a fix for you, but instead I'm offering some refactoring advice.

Load what you need from the annotation file into a hash.
Might as well do your processing one line at a time, no need to have the whole file in memory.

Here follows some code to implement these first thoughts. I have not changed your use of the global @window, @probe, which would be my next targets for refactoring. But I think after you refactor, you may find it easier to debug.

open my $annotation_read_handle, '<', $annotation_file;
my %annotation_for;
while (my $ad = <$annotation_read_handle> ) {
    # read $an_chrom, $prol, $pror out of $ad
    my (
        $an_chrom, undef, undef, 
        $prol,     $pror, undef, 
        undef,     undef, $mess,
    ) = split(/\t/, $ad);
    # read $name out of $mess
    my (undef, undef, $name) = split(/\;/, $mess);
    
    # store for future lookups
    $annotation_for{$an_chrom} = [ $name, $prol, $pror ];
}

# loop through the main data file
OLC: 
    while (my $md = <$main_read_handle> ) {
    # remove newlines
    chomp $md; 
    # pull out chromosome #, window start, end
    my ($main_chrom, $winl, $winr) = split(/\t/, $md);
    # see if $main_chrom has been annotated
    next OLC if !exists $annotation_for{$main_chrom};
    
    my $array_ref = $annotation_for{$main_chrom};
    ( my $name, @probe ) = @$array_ref;
    # put the window start, end into array for further processing
    @window = ($winl, $winr);

    # call the range_finding sub to look for matches
    my $return = range_find();
    next OLC if !$return;
    # upon matching, print out the name of the gene along with the ori
+ginal values
    print OUTPUT "$name\t $md\n";
}
[download]

Comment on Re: Faulty Control Structures? Select or Download Code

Replies are listed 'Best First'.
Re^2: Faulty Control Structures? by bioinformatics (Friar) on Jan 28, 2008 at 23:22 UTC
Thanks for your input. Unfortunately, this won't work as there isn't a good selection of unique identifiers to use as Keys for the Hash. So, using the code you provide, I'd end up with 23 key-value combinations, when I need 330k :). A hash of arrays would work better, in that I could have the values appended to the arrays for each chromosome, but then getting the data out would be a bit of a nightmare. I will look to cleaning up the globals though, as I was being a bit lazy there :). EDIT: Actually, 24 combinations, as there are both x and y to consider :). Bioinformatics	[reply]
Re^3: Faulty Control Structures? by Narveson (Chaplain) on Jan 29, 2008 at 04:55 UTC
You're right. I overlooked the statement label in one of your `next` statements. I could not have arrived at my misreading if I had been as aware as you are that there are only two dozen chromosomes. But what about your hash of arrays? Why would getting the data out be such a nightmare? Populating the hash of arrays: `open my $annotation_read_handle, '<', $annotation_file; my %annotations_for; while (my $ad = <$annotation_read_handle> ) { # read $an_chrom out of $ad my ($an_chrom, undef) = split(/\t/, $ad); # store for future lookups push @$annotations_for{$an_chrom}, $ad; } close $annotation_read_handle;` [download] Now read through the main data file and assign each chromosome number to `my $main_chrom`. `# look up the list of annotations relevant to the current chromoso +me my $annotations_ref = $annotations_for{$main_chrom}; # loop through just these annotations ILC: foreach my $ad (@$annotations_ref) { # ... }` [download] Of course—as other more enlightened commentators have already pointed out—the most important thing to optimize is the range_find subroutine.	[reply] [d/l] [select]


There's more than one way to do things
	PerlMonks