http://qs321.pair.com?node_id=756335

patric has asked for the wisdom of the Perl Monks concerning the following question:

hi monks, i have 2 files. i have to compare the contents of specific columns between the two files, and print the results. it sounds simple, but am getting into trouble when i code. the files looks like this(each file is 500MB in size).the number of lines in both the files are not equal:
sample file1: ast578 61218 15755 + C/A ast579 61218 15856 - A/G/T ast580 61218 65798 + T/C ast581 61218 67955 - A/TTA/AT ast582 61218 68625 + -/G/AT ast599 61218 68025 + (AT)12/34/32 sample file2: 61218 15755 A > T 0 4 0 0 2 7 61218 15856 T > C 3 2 0 3 1 8 61218 42547 A > G 3 7 5 4 1 6 61218 67955 A > G 0 10 0 9 3 4 61218 68625 G > A 0 10 0 8 1 5
Here, the 2nd,3rd, and last columns of the file1 is compared with the 1st,2nd,and 3rd column of the file2. if the single alphabet before '>' symbol in file2 matches with the single alphabet between '/' symbol in file1, the line(file2) is reported as hit and printed in the result. its reported only if the match to file1 is a single alphabet, otherwise, no. the 1st column of file1 is appended to the hits of file2 lines and is reported as output. output file looks like this:
61218 15755 A > T 0 4 0 0 2 7 ast578 61218 15856 T > C 3 2 0 3 1 8 ast579 61218 67955 A > G 0 10 0 9 3 4 ast581
the program which i have written so far is:
#!/usr/bin/perl use warnings; use strict; open(FH,"file1.txt")or die "can not open file"; open(FH1,"file2.txt")or die "can not open file"; open(OUT,">result.out")or die "can not create file"; my @file1; while(my $line1=<FH>){ my @list1=split("\t",$line1); push(@file1,$list1[1]."#".$list1[2],$list1[4]); } my %hash1=@file1; my @file2; while(my $line2=<FH1>){ my @list2=split("\t",$line2); push(@file2,$list2[0]."#".$list2[1],$list2[2]); } my %hash2=@file2; my @allhits; while(my ($key1,$value1)=each(%hash1)){ while(my ($key2,$value2)=each(%hash2)){ if($key1 eq $key2){ $value2=~s/\s//g; my @val1=split("/",$value1); my @val2=split(">",$value2); foreach(@val1){ if($_ eq $val2[1]){ print "$key1\n" #push(@allhits,$key1); } } } } } =pod foreach(@allhits){ while(my $str=<FH1>){ my($chrom,$position,$var,$one,$two,$three,$four,$five,$six)=sp +lit("\t",$str); my($id,$location)=split("#",$_); if(($chrom == $id) && ($position==$location)){ print "$_"; } } }
The program is taking a longggg time to run. can anyone please suggest a simpler way :( i would be really thankful if you can help on this :)

Replies are listed 'Best First'.
Re: hash to hash comparison on large files
by moritz (Cardinal) on Apr 08, 2009 at 13:56 UTC
    You're only iterating over hashes, not using the lookup features. If you do it properly, you'll increase the speed of the loop roughly by the number of elements in %hash2.
    while(my ($key1,$value1)=each(%hash1)){ while(my ($key2,$value2)=each(%hash2)){ if($key1 eq $key2){ ... } }

    Should be better (and faster!) written as

    while(my ($key1,$value1)=each(%hash1)){ if (exists $hash2{$key} ) { my $value2 = $hash2{$key} ... } }

    Also note that this code:

    foreach(@allhits){ while(my $str=<FH1>){
    exhausts the <FH1> iterator for the first value in @allhits, and does nothing for the subsequent values - probably not what you want.

    Likewise I don't see how you ever get items into @file2. (Update: I should have looked more carefully)

Re: hash to hash comparison on large files
by targetsmart (Curate) on Apr 08, 2009 at 13:55 UTC
    instead of
    while(my ($key2,$value2)=each(%hash2)){ if($key1 eq $key2){
    try
    if(exists $hash2{$key1}){
    you can reduce some time.

    UPDATE
    is it a typing mistake

    open(FH,"file1.txt")or die "can not open file"; open(FH1,"file1.txt")or die "can not open file";
    opening the same file twice?!

    since

    #push(@allhits,$key1);
    so
    foreach(@allhits){
    will iterate over nothing!.

    moreover you have to read file1 again to get first column of file1 to be written into file2, so better change(going by your method)

    push(@file1,$list1[1]."#".$list1[2],$list1[4]);
    to
    push(@file1,$list1[1]."#".$list1[2],[$list1[4],$list1[0]]); and in my @val1=split("/",$value1); use my @val1=split("/",$value1->[0]); and in #push(@allhits,$key1); use push(@allhits,[$key1,$value1->[1]]);#(correction from 0 to 1) and in my($id,$location)=split("#",$_); use my($id,$location)=split("#",$_[0]); and in print "$_";. use print "@{$_}"; # to what you got to write.
    (Untested)

    I agree with moritz on

    while(my $str=<FH1>){
    so use seek to move file pointer to the beginning after while loop ends

    Vivek
    -- In accordance with the prarabdha of each, the One whose function it is to ordain makes each to act. What will not happen will never happen, whatever effort one may put forth. And what will happen will not fail to happen, however much one may seek to prevent it. This is certain. The part of wisdom therefore is to stay quiet.
      sooo sorry...its file1 for FH file handle and file2 for FH1 file handle. i changed from original filename to the sample file names that i had given as example. thats y :)
Re: hash to hash comparison on large files
by ig (Vicar) on Apr 08, 2009 at 16:34 UTC

    You might try something like the following.

    use strict; use warnings; use DB_File; my $hashfile = "hash.$$"; tie my %hash1, "DB_File", $hashfile, O_RDWR|O_CREAT, 0666, $DB_HASH or die "cannot open file $hashfile: $!"; open(my $fh1, "<", "file1.txt") or die "file1.txt: $!"; foreach (<$fh1>) { chomp; my @parts = split(/\s+/); $hash1{"$parts[1]#$parts[2]"} = $parts[4]; } close($fh1); open(my $fh2, "<", "file2.txt") or die "file2.txt: $!"; foreach (<$fh2>) { my @parts = split(/[\s>]+/); my $value = $hash1{"$parts[0]#$parts[1]"}; if( defined($value) and grep { $_ eq $parts[2] } split(/\//, $valu +e)) { print "$_"; } } close($fh2); untie %hash1; unlink($hashfile);

    Using a tied hash will allow you to process larger data sets - where an in-memory hash would exceed available memory. You may not need this but a 500MB file will result in quite a large hash. Performance will be better if you can use an in-memory hash (i.e. if you don't use the tied hash).

    Only one hash is built. There is no benefit building a hash from the second file as you never do lookup in that hash.

    Some of your matching criteria weren't clear to me. Your descriptions and code seemed to differ and I wasn't sure what "single character" means with respect to a value like "A/TTA/AT". You may have to make some changes to the matching criteria in the second loop.

      thanks for your suggestions. actually, what i meant by single character is when A/TTA/AT is split, in this case, only "A" is considered eligible to match against the file2 single alphabet variables. TTA and AT should not be matched against file2 single alphabets as they are in 2 alphabers or 3 alphabets together, but NOT single.