hash to hash comparison on large files

patric has asked for the wisdom of the Perl Monks concerning the following question:

hi monks, i have 2 files. i have to compare the contents of specific columns between the two files, and print the results. it sounds simple, but am getting into trouble when i code. the files looks like this(each file is 500MB in size).the number of lines in both the files are not equal:

sample file1:
ast578    61218    15755    +    C/A
ast579    61218    15856    -    A/G/T
ast580    61218    65798    +    T/C
ast581    61218    67955    -    A/TTA/AT
ast582    61218    68625    +    -/G/AT
ast599    61218    68025    +    (AT)12/34/32

sample file2:
61218    15755    A > T    0    4    0    0    2    7
61218    15856    T > C    3    2    0    3    1    8
61218    42547    A > G    3    7    5    4    1    6
61218    67955    A > G    0    10    0    9    3    4
61218    68625    G > A    0    10    0    8    1    5
[download]

Here, the 2nd,3rd, and last columns of the file1 is compared with the 1st,2nd,and 3rd column of the file2. if the single alphabet before '>' symbol in file2 matches with the single alphabet between '/' symbol in file1, the line(file2) is reported as hit and printed in the result. its reported only if the match to file1 is a single alphabet, otherwise, no. the 1st column of file1 is appended to the hits of file2 lines and is reported as output. output file looks like this:

61218    15755    A > T    0    4    0    0    2    7    ast578
61218    15856    T > C    3    2    0    3    1    8    ast579
61218    67955    A > G    0    10    0    9    3    4    ast581
[download]

the program which i have written so far is:

#!/usr/bin/perl
use warnings;
use strict;
open(FH,"file1.txt")or die "can not open file";
open(FH1,"file2.txt")or die "can not open file";
open(OUT,">result.out")or die "can not create file";

my @file1;
while(my $line1=<FH>){
my @list1=split("\t",$line1);
push(@file1,$list1[1]."#".$list1[2],$list1[4]);
}
my %hash1=@file1;

my @file2;
while(my $line2=<FH1>){
my @list2=split("\t",$line2);
push(@file2,$list2[0]."#".$list2[1],$list2[2]);
}
my %hash2=@file2;

my @allhits;
while(my ($key1,$value1)=each(%hash1)){
while(my ($key2,$value2)=each(%hash2)){
    if($key1 eq $key2){
        $value2=~s/\s//g;
        my @val1=split("/",$value1);
        my @val2=split(">",$value2);
        foreach(@val1){
            if($_ eq $val2[1]){
                print "$key1\n"
                #push(@allhits,$key1);
            }
        }
    }
}
}
=pod
foreach(@allhits){
    while(my $str=<FH1>){
        my($chrom,$position,$var,$one,$two,$three,$four,$five,$six)=sp
+lit("\t",$str);
        my($id,$location)=split("#",$_);
        if(($chrom == $id) && ($position==$location)){
            print "$_";
        }
    }
}
[download]

The program is taking a longggg time to run. can anyone please suggest a simpler way :( i would be really thankful if you can help on this :)

Comment on hash to hash comparison on large files Select or Download Code

Replies are listed 'Best First'.
Re: hash to hash comparison on large files by moritz (Cardinal) on Apr 08, 2009 at 13:56 UTC
You're only iterating over hashes, not using the lookup features. If you do it properly, you'll increase the speed of the loop roughly by the number of elements in %hash2. `while(my ($key1,$value1)=each(%hash1)){ while(my ($key2,$value2)=each(%hash2)){ if($key1 eq $key2){ ... } }` [download] Should be better (and faster!) written as `while(my ($key1,$value1)=each(%hash1)){ if (exists $hash2{$key} ) { my $value2 = $hash2{$key} ... } }` [download] Also note that this code: `foreach(@allhits){ while(my $str=<FH1>){` [download] exhausts the `<FH1>` iterator for the first value in `@allhits`, and does nothing for the subsequent values - probably not what you want. ~~Likewise I don't see how you ever get items into @file2.~~ (Update: I should have looked more carefully)	[reply] [d/l] [select]
Re: hash to hash comparison on large files by targetsmart (Curate) on Apr 08, 2009 at 13:55 UTC
instead of `while(my ($key2,$value2)=each(%hash2)){ if($key1 eq $key2){` [download] try `if(exists $hash2{$key1}){` [download] you can reduce some time. UPDATE is it a typing mistake `open(FH,"file1.txt")or die "can not open file"; open(FH1,"file1.txt")or die "can not open file";` [download] opening the same file twice?! since `#push(@allhits,$key1);` [download] so `foreach(@allhits){` [download] will iterate over nothing!. moreover you have to read file1 again to get first column of file1 to be written into file2, so better change(going by your method) `push(@file1,$list1[1]."#".$list1[2],$list1[4]);` [download] to `push(@file1,$list1[1]."#".$list1[2],[$list1[4],$list1[0]]); and in my @val1=split("/",$value1); use my @val1=split("/",$value1->[0]); and in #push(@allhits,$key1); use push(@allhits,[$key1,$value1->[1]]);#(correction from 0 to 1) and in my($id,$location)=split("#",$_); use my($id,$location)=split("#",$_[0]); and in print "$_";. use print "@{$_}"; # to what you got to write.` [download] (Untested) I agree with moritz on `while(my $str=<FH1>){` [download] so use seek to move file pointer to the beginning after while loop ends Vivek -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.	[reply] [d/l] [select]
Re^2: hash to hash comparison on large files by patric (Acolyte) on Apr 08, 2009 at 17:11 UTC
sooo sorry...its file1 for FH file handle and file2 for FH1 file handle. i changed from original filename to the sample file names that i had given as example. thats y :)	[reply]
Re: hash to hash comparison on large files by ig (Vicar) on Apr 08, 2009 at 16:34 UTC
You might try something like the following. use strict; use warnings; use DB_File; my $hashfile = "hash.$$"; tie my %hash1, "DB_File", $hashfile, O_RDWR\|O_CREAT, 0666, $DB_HASH or die "cannot open file $hashfile: $!"; open(my $fh1, "<", "file1.txt") or die "file1.txt: $!"; foreach (<$fh1>) { chomp; my @parts = split(/\s+/); $hash1{"$parts[1]#$parts[2]"} = $parts[4]; } close($fh1); open(my $fh2, "<", "file2.txt") or die "file2.txt: $!"; foreach (<$fh2>) { my @parts = split(/[\s>]+/); my $value = $hash1{"$parts[0]#$parts[1]"}; if( defined($value) and grep { $_ eq $parts[2] } split(/\//, $valu +e)) { print "$_"; } } close($fh2); untie %hash1; unlink($hashfile); [download] Using a tied hash will allow you to process larger data sets - where an in-memory hash would exceed available memory. You may not need this but a 500MB file will result in quite a large hash. Performance will be better if you can use an in-memory hash (i.e. if you don't use the tied hash). Only one hash is built. There is no benefit building a hash from the second file as you never do lookup in that hash. Some of your matching criteria weren't clear to me. Your descriptions and code seemed to differ and I wasn't sure what "single character" means with respect to a value like "A/TTA/AT". You may have to make some changes to the matching criteria in the second loop.	[reply] [d/l]
Re^2: hash to hash comparison on large files by patric (Acolyte) on Apr 08, 2009 at 17:26 UTC
thanks for your suggestions. actually, what i meant by single character is when A/TTA/AT is split, in this case, only "A" is considered eligible to match against the file2 single alphabet variables. TTA and AT should not be matched against file2 single alphabets as they are in 2 alphabers or 3 alphabets together, but NOT single.	[reply]


We don't bite newbies here... much
	PerlMonks