Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: duplicates getting omitted while comparing values inside foreach.

by ambrus (Abbot)
on Apr 16, 2009 at 08:51 UTC ( #757900=note: print w/replies, xml ) Need Help??


in reply to duplicates getting omitted while comparing values inside foreach.

And to prove that what the join utility does (see solution above) is not trivial but also not very complicated, let's try to emulate what it does in a short perl program.

perl -we 'open $_, "<", shift or die for my($D, $Q); my $p = -1e9999; +my $s = -1e9999; my $x; while (my $q = <$Q>) { chomp($q); my $k = (sp +lit " ", $q)[1]; $p <= $k or die "query not sorted"; $p = $k; while ( +$s < $p and defined(my $d = <$D>)) { (my($l), $x) = (split " ", $d)[2 +,0]; $s <= $l or die "db not sorted"; $s = $l; } my $z = $p == $s ? $ +x : "-"; print $q, " ", $z, "\n"; }' file1.db file2.query

The output is this.

1190 31277 A > T 1 0 0 - 1190 31607 C > A 0 3 1 - 1190 31629 C > T 0 2 0 - 1190 31789 A > G 1 2 5 zm1829427 1190 31882 A > C 0 4 0 - 1190 31883 T > A 0 4 0 zm445312 1190 31883 T > C 2 2 5 zm445312 1190 32199 C > T 0 1 1 - 1190 32487 T > C 0 1 1 - 1190 32496 A > G 0 3 0 -

It is an exercise to the reader to add the "Number of HITS" message.

Replies are listed 'Best First'.
Re^2: duplicates getting omitted while comparing values inside foreach.
by patric (Acolyte) on Apr 16, 2009 at 09:31 UTC
    this is correct... so instead of 3 hits, it should be 1. final result should look like :
    1190 31277 A > T 1 0 0 - 1190 31607 C > A 0 3 1 - 1190 31629 C > T 0 2 0 - 1190 31789 A > G 1 2 5 zm1829427 1190 31882 A > C 0 4 0 - 1190 31883 T > A 0 4 0 - 1190 31883 T > C 2 2 5 - 1190 32199 C > T 0 1 1 - 1190 32487 T > C 0 1 1 - 1190 32496 A > G 0 3 0 - Total number of Hits: 1
    sorry for the confusion in between :(

      Please explain why this should be the answer. What's your criterion for joining actually?

        Please excuse me for my fast/paranoid decisions. i get confused easily when i face little problems. I confuse other people also... am sorry for that. I have to build some patience in me. hehehehe :) well, the criteria is to match the alphabets. The alphabets seen in the 3rd column of file2.query should be checked if its available in the 5th(last) column of file1.db. In case, the only common columns are the 2nd and 3rd of file1 and 1st and 2nd column of file2. With this as basis, i compare the alphabets in both the entries(files). For every line in file2, If all (2-seperated by'>') the alphabets (seperated by '/'. can be many, just not 2) are found in file1 at the corresponding line, its reported as a HIT. In the output i needed
        1190 31789 A > G 1 2 5 zm1829427
        only in this lines, the "A>G" of file2 matches with "A/G" of file1. The rest is not considered as hit because, the alphabets of file2 in not present in file1. my final decision: the output is:
        1190 31277 A > T 1 0 0 - 1190 31607 C > A 0 3 1 - 1190 31629 C > T 0 2 0 - 1190 31789 A > G 1 2 5 zm1829427 1190 31882 A > C 0 4 0 - 1190 31883 T > A 0 4 0 - 1190 31883 T > C 2 2 5 - 1190 32199 C > T 0 1 1 - 1190 32487 T > C 0 1 1 - 1190 32496 A > G 0 3 0 - Total number of HITS: 1
Re^2: duplicates getting omitted while comparing values inside foreach.
by patric (Acolyte) on Apr 16, 2009 at 09:07 UTC
    sorry for bothering you all, i made a silly mistake and i rectified it :) thanks for all your suggestions. now that there are other methods also for me to look at from all your suggestions :)
    the corrections i did was: if(defined($variant)){ foreach my $lis(@snplist){ if($queryinfo[2] eq $lis){$flag_left=1;} elsif($queryinfo[3] eq $lis){$flag_right=1;} } if(($flag_left == 1) && ($flag_right == 1)){ print OUT "$_\t$rs\n"; $c++; $flag_left=0;$flag_right=0; } else{ print OUT "$_\t-\n"; } } else{ print OUT "$_\t-\n"; }
    i should just add an else statement under if.. thats all :)
Re^2: duplicates getting omitted while comparing values inside foreach.
by patric (Acolyte) on Apr 16, 2009 at 09:15 UTC
    sorry to say this, but
    1190 31883 T > A 0 4 0 zm445312 1190 31883 T > C 2 2 5 zm445312
    this code "zm445312" doesnt come under hit category. only "zm1829427" has to be reported.
Re^2: duplicates getting omitted while comparing values inside foreach.
by patric (Acolyte) on Apr 16, 2009 at 09:17 UTC
    the number of hits i should get is 1, not 3.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://757900]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2022-01-16 10:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (49 votes). Check out past polls.

    Notices?