hi monks,
i have 2 files. i have to compare the contents of specific columns between the two files, and print the results. it sounds simple, but am getting into trouble when i code. the files looks like this(each file is 500MB in size).the number of lines in both the files are not equal:
sample file1:
ast578 61218 15755 + C/A
ast579 61218 15856 - A/G/T
ast580 61218 65798 + T/C
ast581 61218 67955 - A/TTA/AT
ast582 61218 68625 + -/G/AT
ast599 61218 68025 + (AT)12/34/32
sample file2:
61218 15755 A > T 0 4 0 0 2 7
61218 15856 T > C 3 2 0 3 1 8
61218 42547 A > G 3 7 5 4 1 6
61218 67955 A > G 0 10 0 9 3 4
61218 68625 G > A 0 10 0 8 1 5
Here, the 2nd,3rd, and last columns of the file1 is compared with the 1st,2nd,and 3rd column of the file2.
if the single alphabet before '>' symbol in file2 matches with the single alphabet between '/' symbol in file1, the line(file2) is reported as hit and printed in the result. its reported only if the match to file1 is a single alphabet, otherwise, no. the 1st column of file1 is appended to the hits of file2 lines and is reported as output. output file looks like this:
61218 15755 A > T 0 4 0 0 2 7 ast578
61218 15856 T > C 3 2 0 3 1 8 ast579
61218 67955 A > G 0 10 0 9 3 4 ast581
the program which i have written so far is:
#!/usr/bin/perl
use warnings;
use strict;
open(FH,"file1.txt")or die "can not open file";
open(FH1,"file2.txt")or die "can not open file";
open(OUT,">result.out")or die "can not create file";
my @file1;
while(my $line1=<FH>){
my @list1=split("\t",$line1);
push(@file1,$list1[1]."#".$list1[2],$list1[4]);
}
my %hash1=@file1;
my @file2;
while(my $line2=<FH1>){
my @list2=split("\t",$line2);
push(@file2,$list2[0]."#".$list2[1],$list2[2]);
}
my %hash2=@file2;
my @allhits;
while(my ($key1,$value1)=each(%hash1)){
while(my ($key2,$value2)=each(%hash2)){
if($key1 eq $key2){
$value2=~s/\s//g;
my @val1=split("/",$value1);
my @val2=split(">",$value2);
foreach(@val1){
if($_ eq $val2[1]){
print "$key1\n"
#push(@allhits,$key1);
}
}
}
}
}
=pod
foreach(@allhits){
while(my $str=<FH1>){
my($chrom,$position,$var,$one,$two,$three,$four,$five,$six)=sp
+lit("\t",$str);
my($id,$location)=split("#",$_);
if(($chrom == $id) && ($position==$location)){
print "$_";
}
}
}
The program is taking a longggg time to run. can anyone please suggest a simpler way :( i would be really thankful if you can help on this :)
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.