comment on

dear all, i am having a weird problem. i am comparing a specific column between two large files. file1 is a database text file. and file2 is a query text file. i want to compare a particular column (which has alphabets) from file2 to file1. final results should have all the lines in file2, with an extra added column from file1. for example.

file1.db:
zm811463    1190    31050    -    A/G/T/C
zm811462    1190    31051    -    C/T
zm1829427    1190    31789    +    A/G
zm445312    1190    31883    -    A/G
zm5377419    1190    32419    +    A/C
zm1052506    1190    32829    +    C/G
zm1052507    1190    32886    +    A/C/T
zm9115338    1190    33832    +    A/G/CC

file2.query
1190    31277    A > T    1    0    0
1190    31607    C > A    0    3    1
1190    31629    C > T    0    2    0
1190    31789    A > G    1    2    5
1190    31882    A > C    0    4    0
1190    31883    T > A    0    4    0
1190    31883    T > C    2    2    5
1190    32199    C > T    0    1    1
1190    32487    T > C    0    1    1
1190    32496    A > G    0    3    0

output which i am getting now:
1190    31277    A > T    1    0    0    -
1190    31607    C > A    0    3    1    -
1190    31629    C > T    0    2    0    -
1190    31789    A > G    1    2    5    zm1829427
1190    31882    A > C    0    4    0    -
1190    32199    C > T    0    1    1    -
1190    32487    T > C    0    1    1    -
1190    32496    A > G    0    3    0    -

Total number of HITS: 1

BUT, i want my output look like:

1190    31277    A > T    1    0    0    -
1190    31607    C > A    0    3    1    -
1190    31629    C > T    0    2    0    -
1190    31789    A > G    1    2    5    zm1829427
1190    31882    A > C    0    4    0    -
1190    31883    T > A    0    4    0    -
1190    31883    T > C    2    2    5    -
1190    32199    C > T    0    1    1    -
1190    32487    T > C    0    1    1    -
1190    32496    A > G    0    3    0    -

Total number of HITS: 1
[download]

if you notice keenly, these two lines are missing in my output:

1190    31883    T > A    0    4    0       -
1190    31883    T > C    2    2    5       -
[download]

why is this happening? the program so far looks like this:

use strict;
use warnings;
use DB_File;


my $myhashfile = "hash.$$";
tie my %hash1, "DB_File", $myhashfile, O_RDWR|O_CREAT, 0666, $DB_HASH
    or die "cannot open file $myhashfile: $!";


open(OUT,">output.out")or die "can not open";
open(my $fh1, "<", "file1.db") or die "file1.db: $!";


foreach (<$fh1>){
    chomp;
    my @dbinfo = split(/\s+/);
    $hash1{"$dbinfo[1]#$dbinfo[2]"} = "$dbinfo[4]##$dbinfo[0]";
}
close($fh1);


my $c=0;
open(my $fh2, "<", "file2.query") or die "file2.query: $!";
my @snplist;
foreach (<$fh2>) {
chomp($_);@snplist=();

    my @queryinfo = split(/[\s>]+/);
    my $values = $hash1{"$queryinfo[0]#$queryinfo[1]"};
    my ($variant,$rs)=split("##",$values);

    my $flag_left=0;my $flag_right=0;
    @snplist=split("/",$variant);

    if(defined($variant)){ 
    foreach my $lis(@snplist){
        if($queryinfo[2] eq $lis){$flag_left=1;}
        elsif($queryinfo[3] eq $lis){$flag_right=1;}
    }

        if(($flag_left == 1) && ($flag_right == 1)){
        print OUT "$_\t$rs\n";
        $c++;
        $flag_left=0;$flag_right=0;
        }
  }

  else{
    print OUT "$_\t-\n";
  }
}


print OUT "\nTotal number of HITS: $c\n";
close($fh2);

untie %hash1;
unlink($myhashfile);
$myhashfile=();
[download]

using hash tie, because am dealing with large files.using @snplist because, there can be more alphabets to compare and only if both(seperated by ">") the alphabets in file2 third column is present in file1 5th column(which has many seperated by "/"), its considered as hit. plz help.thank you very much.

In reply to duplicates getting omitted while comparing values inside foreach. by patric

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Come for the quick hacks, stay for the epiphanies.
	PerlMonks