Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

how to compare two hashes with perl?

by FluffyBunny (Acolyte)
on Nov 04, 2009 at 20:07 UTC ( [id://805041]=perlquestion: print w/replies, xml ) Need Help??

FluffyBunny has asked for the wisdom of the Perl Monks concerning the following question:

Dear perl monks, I started self-learning perl because of my job (I'm a biochemist student who needs to handle huge txt files, 3Gb each) I need to compare two files and they're in the same format. I'm still confused with these hashes and I wrote an inefficient code. What I want is first column (ID) and 5th column (sequence) from each file and compare them. 1)Check ID names 2)If they dont match, print both ID and the sequences from each file 3)If they match, and the sequences match, do not print 4)If they match, but the sequences do not match, print both ID and the sequences from each file This is what I'm trying to do. I do not have any syntax errors, but I get a weird output. I'd very appriciate if you could help me out ASAP! Thank you very much =)
#!/usr/bin/perl use warnings; use strict; my %bow1 = (); my $file1 = shift; open (FILE1, "$file1"); # Open first file while (<FILE1>) { my ($ID1, undef, undef, undef, $Seq1) = split; $bow1{$ID1}[0] = $ID1; $bow1{$ID1}[1] = $Seq1; } close FILE1; my %bow2 = (); my $file2 = shift; open (FILE2, "$file2"); # Open second file while (<FILE2>) { my ($ID2, undef, undef, undef, $Seq2) = split; $bow2{$ID2}[0] = $ID2; $bow2{$ID2}[1] = $Seq2; } close FILE2; for my $ID1 (keys %bow1) { for my $ID2 (keys %bow2) { if (defined $bow1{$ID1}[0] and defined $bow2{$ID2}[0]) { if ($bow1{$ID1}[0] eq $bow2{$ID2}[0]) { if ($bow1{$ID1}[1] =! $bow2{$ID2}[1] ) { print "$bow1{$ID1}[0]\t$bow1{$ID1}[1]\t$bow2{$ID2} +[0]\t$bow2{$ID2}[1]\n"; } } if ($bow1{$ID1}[0] ne $bow2{$ID2}[0]) { print "\nIDs do not match\n"; print "$bow1{$ID1}[0]\t$bow1{$ID1}[1]\t$bow2{$ID2}[0]\t$bo +w2{$ID2}[1]\n"; } } } } exit;
UPDATE :My current output is like:
IDs do not match HWUSI-EAS548:7:120:1791:800#0/1 HWUSI-EAS548:7:1:5:1566#0/1 + CCACTGTGCTCCAGACTGCGTGACAGAGTGAGACTC IDs do not match HWUSI-EAS548:7:120:1791:800#0/1 HWUSI-EAS548:7:1:5:893#0/1 + TTTGATGATTTCATTTGATTCCATTCGTTAATGATT IDs do not match HWUSI-EAS548:7:120:1791:800#0/1 HWUSI-EAS548:7:1:5:1527#0/1 + CGGAGCTTGCAGTGAGCCGAGATCGCGCTACTGCAC IDs do not match HWUSI-EAS548:7:120:1791:800#0/1 HWUSI-EAS548:7:1:4:331#0/1 + ANTGAGATACCATCTCACGCCAGTCAGACTGGCAAT IDs do not match HWUSI-EAS548:7:120:1791:800#0/1 HWUSI-EAS548:7:1:5:1098#0/1 + CTTTGCATATTTTGGATAGACACCCAGAAGTGGAAT IDs do not match HWUSI-EAS548:7:120:1791:800#0/1 HWUSI-EAS548:7:1:5:877#0/1 + AGGCCAGCAGATCACCTGAGGTTGGGAGTTCGAGAC IDs do not match HWUSI-EAS548:7:1:5:68#0/1 AATTAGCCAGGTGTGGTGGCGCATGCCTGTAATCCC + HWUSI-EAS548:7:120:1791:900#0/1 ATTCCATTCCATTCCATTCCATTCCATTCCGTTCN +G IDs do not match HWUSI-EAS548:7:1:5:68#0/1 AATTAGCCAGGTGTGGTGGCGCATGCCTGTAATCCC + HWUSI-EAS548:7:120:1791:800#0/1 TTTAAAAAAAAAAAAAAAAAAAAAAAAATAATTTN +T IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:377#0/1 + TTCCTTTCAATCATTCCCTTTGATTCCATTCAAAGG IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:1530#0/1 + TTCCTGTCGCGTCCATTCCATTCCATTTCACTCCAT IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:4:12#0/1 + CCCTGCGACTTGATNCCCTTAGCTGCTGAAGGACNC IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:1566#0/1 + CCACTGTGCTCCAGACTGCGTGACAGAGTGAGACTC IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:893#0/1 + TTTGATGATTTCATTTGATTCCATTCGTTAATGATT IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:1527#0/1 + CGGAGCTTGCAGTGAGCCGAGATCGCGCTACTGCAC IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:4:331#0/1 + ANTGAGATACCATCTCACGCCAGTCAGACTGGCAAT IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:1098#0/1 + CTTTGCATATTTTGGATAGACACCCAGAAGTGGAAT IDs do not match HWUSI-EAS548:7:1:5:68#0/1 HWUSI-EAS548:7:1:5:877#0/1 + AGGCCAGCAGATCACCTGAGGTTGGGAGTTCGAGAC IDs do not match HWUSI-EAS548:7:1:5:377#0/1 TTCCTTTCAATCATTCCCTTTGATTCCATTCAAAGG + HWUSI-EAS548:7:120:1791:900#0/1 ATTCCATTCCATTCCATTCCATTCCATTCCGTTCN +G IDs do not match HWUSI-EAS548:7:1:5:377#0/1 TTCCTTTCAATCATTCCCTTTGATTCCATTCAAAGG + HWUSI-EAS548:7:120:1791:800#0/1 TTTAAAAAAAAAAAAAAAAAAAAAAAAATAATTTN +T IDs do not match HWUSI-EAS548:7:1:5:377#0/1 TTCCTTTCAATCATTCCCTTTGATTCCATTCAAAGG + HWUSI-EAS548:7:1:5:68#0/1 AATTAGCCAGGTGTGGTGGCGCATGCCTGTAATCC +C IDs do not match HWUSI-EAS548:7:1:5:377#0/1 HWUSI-EAS548:7:1:5:1530#0/1 + TTCCTGTCGCGTCCATTCCATTCCATTTCACTCCAT IDs do not match HWUSI-EAS548:7:1:5:377#0/1 HWUSI-EAS548:7:1:4:12#0/1 + CCCTGCGACTTGATNCCCTTAGCTGCTGAAGGACNC IDs do not match HWUSI-EAS548:7:1:5:377#0/1 HWUSI-EAS548:7:1:5:1566#0/1 + CCACTGTGCTCCAGACTGCGTGACAGAGTGAGACTC IDs do not match HWUSI-EAS548:7:1:5:377#0/1 HWUSI-EAS548:7:1:5:893#0/1 + TTTGATGATTTCATTTGATTCCATTCGTTAATGATT
Input file one:
HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 CGGAGC +TTGCAGTGAGCCGAGATCGCGCTACTGCAC a`]``_``_```TTXF_[SXU^]aQ`][ZVZVPQ\ +\ 29 HWUSI-EAS548:7:1:5:1098#0/1 + chr1 245241196 CTTTGC +ATATTTTGGATAGACACCCAGAAGTGGAAT ababbabbbbababaaaaYV`baab^aa`WXPN^a +` 0 HWUSI-EAS548:7:1:5:877#0/1 + chr13 40851377 AGGCCA +GCAGATCACCTGAGGTTGGGAGTTCGAGAC a`aaaa^aaaa`b`a`ab^_XZ`a``\`^a`]aRL +\ 1 HWUSI-EAS548:7:1:4:331#0/1 - chr13 91090676 ANTGAG +ATACCATCTCACGCCAGTCAGACTGGCAAT BB]RXZVVT]aYZYZYY``]Y]^]Y]_]`]`aa__ +a 0 34:T>N HWUSI-EAS548:7:1:4:12#0/1 + chr4 100790527 CCCTGC +GACTTGATNCCCTTAGCTGCTGAAGGACNC aaaaaa`\a```a`B`aa`]_Za]Y_]YQ[OOX`B +a 0 14:G>N,34:T>N HWUSI-EAS548:7:1:5:1530#0/1 - chr10 42117291 TTCCTG +TCGCGTCCATTCCATTCCATTTCACTCCAT XXSL]a][`aa^^aaa\_aa`]_`aaa`b^a]_aa +a 0 30:T>G,31:A>T HWUSI-EAS548:7:1:5:893#0/1 + chr16 44950626 TTTGAT +GATTTCATTTGATTCCATTCGTTAATGATT aabaYabaaa`aaa`aa`aaa_aa`aa_`Z]`]a` +^ 1 HWUSI-EAS548:7:1:5:1566#0/1 - chr1 36436440 CCACTG +TGCTCCAGACTGCGTGACAGAGTGAGACTC BBBBB\P_WR]YVa^]Z_`bba\^aaaabbab]ab +` 0 HWUSI-EAS548:7:1:5:377#0/1 - chr16 44951483 TTCCTT +TCAATCATTCCCTTTGATTCCATTCAAAGG Y]\W``Ya`Z[`^a[]``[U^a^`_[`aaab``aa +a 1 HWUSI-EAS548:7:1:5:68#0/1 + chr2 68413664 AATTAG +CCAGGTGTGGTGGCGCATGCCTGTAATCCC ``bab`bba`XPa[U[__]a`a_X^a`ZZTZOU^` +_ 233 HWUSI-EAS548:7:120:1791:926#0/1 - chr4 48846414 ATTCCA +TTCCATTCCATTCCATTCCATTCCGTTCNG `YS^a`]X_``_W`ba^[aaa``bbbaabaaa`aB +a 3 1:C>N HWUSI-EAS548:7:120:1791:800#0/1 - chr8 13214240 AAAAAA +AAAAAAAAAAAAAAAAAAAAAATAATTTNT aaaaaaaaaaa`aaaaaaaaaa`[`aaaT_[a_aB +` 0 1:C>N
Input file two:
HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 CGGAGC +TTGCAGTGAGCCGAGATCGCGCTACTGCAC a`]``_``_```TTXF_[SXU^]aQ`][ZVZVPQ\ +\ 29 HWUSI-EAS548:7:1:5:1098#0/1 + chr1 245241196 CTTTGC +ATATTTTGGATAGACACCCAGAAGTGGAAT ababbabbbbababaaaaYV`baab^aa`WXPN^a +` 0 HWUSI-EAS548:7:1:5:877#0/1 + chr13 40851377 AGGCCA +GCAGATCACCTGAGGTTGGGAGTTCGAGAC a`aaaa^aaaa`b`a`ab^_XZ`a``\`^a`]aRL +\ 1 HWUSI-EAS548:7:1:4:331#0/1 - chr13 91090676 ANTGAG +ATACCATCTCACGCCAGTCAGACTGGCAAT BB]RXZVVT]aYZYZYY``]Y]^]Y]_]`]`aa__ +a 0 34:T>N HWUSI-EAS548:7:1:4:12#0/1 + chr4 100790527 CCCTGC +GACTTGATNCCCTTAGCTGCTGAAGGACNC aaaaaa`\a```a`B`aa`]_Za]Y_]YQ[OOX`B +a 0 14:G>N,34:T>N HWUSI-EAS548:7:1:5:1530#0/1 - chr10 42117291 TTCCTG +TCGCGTCCATTCCATTCCATTTCACTCCAT XXSL]a][`aa^^aaa\_aa`]_`aaa`b^a]_aa +a 0 30:T>G,31:A>T HWUSI-EAS548:7:1:5:893#0/1 + chr16 44950626 TTTGAT +GATTTCATTTGATTCCATTCGTTAATGATT aabaYabaaa`aaa`aa`aaa_aa`aa_`Z]`]a` +^ 1 HWUSI-EAS548:7:1:5:1566#0/1 - chr1 36436440 CCACTG +TGCTCCAGACTGCGTGACAGAGTGAGACTC BBBBB\P_WR]YVa^]Z_`bba\^aaaabbab]ab +` 0 HWUSI-EAS548:7:1:5:377#0/1 - chr16 44951483 TTCCTT +TCAATCATTCCCTTTGATTCCATTCAAAGG Y]\W``Ya`Z[`^a[]``[U^a^`_[`aaab``aa +a 1 HWUSI-EAS548:7:1:5:68#0/1 + chr2 68413664 AATTAG +CCAGGTGTGGTGGCGCATGCCTGTAATCCC ``bab`bba`XPa[U[__]a`a_X^a`ZZTZOU^` +_ 233 HWUSI-EAS548:7:120:1791:900#0/1 - chr4 48846414 ATTCCA +TTCCATTCCATTCCATTCCATTCCGTTCNG `YS^a`]X_``_W`ba^[aaa``bbbaabaaa`aB +a 3 1:C>N HWUSI-EAS548:7:120:1791:800#0/1 - chr8 13214240 TTTAAA +AAAAAAAAAAAAAAAAAAAAAATAATTTNT aaaaaaaaaaa`aaaaaaaaaa`[`aaaT_[a_aB +` 0 1:C>N

Replies are listed 'Best First'.
Re: how to compare two hashes with perl?
by BioLion (Curate) on Nov 04, 2009 at 20:32 UTC

    It would help us a lot if you provided some example input and some sample output too (what you want and what you actually get ).

    Do you ids come in the same order in the two files? In whick case there is no need to read in the whole of the first file, but you can just compare the two files on a line by line basis.

    I would guess that your main source of inefficiency comes from how you are doing your comparisons:

    for my $ID1 (keys %bow1) { for my $ID2 (keys %bow2) { ...

    This will iterate though every ID in the first hash and compare it with every id in the second hash! From the sounds of it you are only interested in comparing entries with matching IDs, so why not use a hash as it was intended and look up the appropriate ID?

    Also I don't understand why you are storing the IDs twice (both as the key and as an entry in the value array ( $bow2{$ID2}[0] = $ID2; )

    Lastly and very much OT, you should always check for success on filehandle operations:

    open (my $fh2, '<', "$file2") || die "Failed to open $file2 for readin +g : $!"; ...do stuff... close $fh2 || die "Failed to close $file2 : $!";
    Just a something something...

      Thank you for your reply.

      IDs might not be in the same order that's why I'm looking for a certain ID I have in file 1 to match with any ID in file 2...

      This is what I wanted to check basically.

      1)Check ID names.

      2)If they match, and the sequences match, do not print.

      3)If they match, but the sequences do not match, print both ID and the sequences from each file.

      4)If they dont match, print both ID and the sequences from each file.

      I'm a newbie, and I'm trying to understand hash.. it's just confusing and I'm not exactly sure how my file gets stored in hash. I hear hash is random when it prints output and I want my ID doesn't get mixed with wrong sequences (an ID uniquely corresponds to each sequence).

      I updated the original post with my output and input files.

      Thank you!
        I hear hash is random when it prints output

        That just means that the order in which you add key/value pairs to a hash is not the order in which they are stored in the hash. Here is an example:

        use strict; use warnings; $\ = "\n"; $, = ', '; my %hash = (); $hash{"h"} = 10; $hash{"z"} = 20; $hash{"a"} = 30; foreach my $key (keys %hash) { print "$key: $hash{$key}"; } --output:-- a: 30 h: 10 z: 20

        However, the key/value pairs are the same. A key will never be associated with a value that you did not enter for that key.

        it's just confusing and I'm not exactly sure how my file gets stored in hash

        Take a look at this example:

        use strict; use warnings; $\ = "\n"; $, = ', '; my %results = (); my $line = 'HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 CGGAGC'; my @pieces = split /\s+/, $line; my $id = $pieces[0]; my $seq = $pieces[-1]; $results{$id} = $seq; foreach my $key (keys %results) { print "$key -----> $results{$key}"; } --output:-- HWUSI-EAS548:7:1:5:1527#0/1 -----> CGGAGC

        If you want to gather all the sequences corresponding to an id, you can do this:

        use strict; use warnings; $\ = "\n"; $, = ', '; my %results = (); while (<DATA>) { my @pieces = split /\s+/; my $id = $pieces[0]; my $seq = $pieces[-1]; $results{$id} = [] unless exists $results{$id}; push @{$results{$id}}, $seq; } foreach my $key (keys %results) { my $arr_str = join ',', @{$results{$key}}; print "$key -----> [$arr_str]"; } __DATA__ HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 CGGAGC HWUSI-EAS548:7:1:5:1527#0/1 + chr12 52084152 XXXXXX Some_other_id + chr12 52084152 CGGAGC

        You might want to experiment a little more with hashes in a separate practice program. For instance, you might want to read perlintro and perldsc, which you can read by typing:

        $ man perlintro or $ man perdsc

        For a complete list of topics available type:

        $man perl

        and scroll down.

        I take it this is bowtie output? It makes no sense to me why you are comparing all IDs in the first file to all IDs in the second? The whole point of using a hash is that you can look up specific keys, whereas an array would be for storing an ordered list.

        What are you actually trying to do? Get the common IDs between the files and say whether their associated sequences match? You can try something like this for that :

        foreach my $id (keys %hash1){ # you can use (sort keys %hash1) if you +want them in a specified order if ( exists $hash2{$id} ){ print "\'$id\' exists in both hashes.\n"; if ( $hash1{$id} eq $hash2{$id} ){ ## id and sequence are stored as key value pairs print "and the sequences match too.\n"; } else{ print "but the sequences do not match.\n"; } } else { print "\'$id\' only exists in hash1.\n"; } }

        If you want help with data strucutes, try perldsc for starters.

        Just a something something...
Re: how to compare two hashes with perl?
by Old_Gray_Bear (Bishop) on Nov 04, 2009 at 20:58 UTC
    You might take a look at Data::Compare on CPAN.

    Your code would reduce to the following Perlish pseudo-code:

    /usr/bin/perl use strict; use warnings; use Data::Compare; # Note: you probably will have to install this from + CPAN # Load the first hash (bow1) # Load the second hash (bow2) # Now the magic -- my $rc = compare(\%bow1, \%bow2); if ($rc) { print "Equality\n"); } else { print ("The hashes are not equal\n"); } exit(0);
    (Coded but not tested. Your compilation may vary....)

    Look at the Synopsis of the Data::Compare module for examples of usage.

    ----
    I Go Back to Sleep, Now.

    OGB

      my two files might not be in order... and I'm only checking with my first column and 5th column. I'm not sure if the program is applicable for me
Re: how to compare two hashes with perl?
by colwellj (Monk) on Nov 04, 2009 at 23:44 UTC
    try this
    #!/usr/bin/perl use warnings; use strict; my %bow1 = (); my $file1 = shift; open (FILE1, "$file1")or die "could not open FILE1:$file1; # Open firs +t file while (<FILE1>) { my ($ID1, undef, undef, undef, $Seq1) = split; $bow1{$ID1} = $ID1; # your id is the key and can be got again } close FILE1; my $file2 = shift; open (FILE2, "$file2")or die "could not open FILE2:$file2"; # Open sec +ond file while (<FILE2>) { my ($ID2, undef, undef, undef, $Seq2) = split; if(defined($bow1{$ID2})){#IDs Match if($bow1{$ID2} eq $Seq2){ #both match so remove from the hash to avoid looking fo +r a correct one later delete $bow1{$ID2}; }else{ #id's dont match print "IDs exist but dont match ID:$ID2:Seq1:$bow1{$ID2}:Seq2:$ +Seq2:\n"; } }else{ #id not in list 1 print "ID only in file 2 ID:$ID2:$Seq1\n" } } close FILE2 #run through any remaing list one items that werent matched while((my $key,my $value) = each %bow1){ print "ID only in file 1 ID:$key:Seq:$value:\n"; }
      I think that this part in your code:
      while (<FILE1>) { my ($ID1, undef, undef, undef, $Seq1) = split; $bow1{$ID1} = $ID1; # your id is the key and can be got again }
      should actually be like this:
      while (<FILE1>) { my ($ID1, undef, undef, undef, $Seq1) = split; $bow1{$ID1} = $Seq1; # ID is hash key, Seq is hash value }
        oops, well spotted.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://805041]
Approved by BioLion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-26 00:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found