match two files

yueli711 has asked for the wisdom of the Perl Monks concerning the following question:

Hello I wrote a perl code to match two files. But when the input file is very large, it runs very very long time. How I can shorter the running time by change some code? Thanks in advance for any great help! Best, Yue

open(IN1,"tmp12") || die "Cannot open this file";

@lines1 = <IN1>;

open(IN2,"donor_82_01.csv") || die "Cannot open this file";

@lines2 = <IN2>;


open(OUT,">tmp12_01") || die "Cannot open this file";

for $item1(@lines1){

    chomp $item1;


    #print OUT $item1,"\t";

@tmp1=split(/\t+/, $item1);


for $item2(@lines2){

    chomp $item2;

@tmp2=split(/\,+/, $item2);

        if ($tmp1[1] eq $tmp2[0]){

            print OUT $tmp1[0],",",$item2;

            last;    

        }

    $i++
}


print OUT "\n";


}


close(IN1);

close(IN2);

close(OUT);
[download]

The file of tmp12 is:


A1BG    ENSG00000121410
A1BG-AS1    ENSG00000268895
A1CF    ENSG00000148584
A2M    ENSG00000175899
A2M-AS1    ENSG00000245105
A2ML1    ENSG00000166535
A2ML1-AS1    ENSG00000256661
A2ML1-AS2    ENSG00000256904
A3GALT2    ENSG00000184389
A4GALT    ENSG00000128274
A4GNT    ENSG00000118017
AAAS    ENSG00000094914
AACS    ENSG00000081760
AADAC    ENSG00000114771
AADACL2    ENSG00000197953
AADACL2-AS1    ENSG00000242908
AADACL3    ENSG00000188984
AADACL4    ENSG00000204518
AADAT    ENSG00000109576
AAGAB    ENSG00000103591
AAK1    ENSG00000115977
AAMDC    ENSG00000087884
AAMP    ENSG00000127837
AANAT    ENSG00000129673
AAR2    ENSG00000131043
AARD    ENSG00000205002
AARS1    ENSG00000090861
AARS2    ENSG00000124608
AARSD1    ENSG00000266967
AASDH    ENSG00000157426
AASDHPPT    ENSG00000149313
AASS    ENSG00000008311
AATBC    ENSG00000215458
AATF    ENSG00000275700
AATK    ENSG00000181409
ABALON    ENSG00000281376
ABAT    ENSG00000183044
ABCA1    ENSG00000165029
ABCA10    ENSG00000154263
ABCA12    ENSG00000144452
ABCA13    ENSG00000179869
ABCA2    ENSG00000107331
ABCA3    ENSG00000167972
ABCA4    ENSG00000198691
ABCA5    ENSG00000154265
ABCA6    ENSG00000154262
ABCA7    ENSG00000064687
ABCA8    ENSG00000141338
ABCA9    ENSG00000154258
[download]

The file of donor_82_01.csv is:

,AAACCTGAGCGTTTAC-1,AAACCTGAGTCGCCGT-1,AAACCTGGTAGGACAC-1,AAACCTGGTGCC
+TTGG-1,AAACCTGGTTCAGCGC-1
ENSG00000148584,0,0,0,0,0
ENSG00000237613,0,0,0,0,0
ENSG00000186092,0,0,0,0,0
ENSG00000118017,0,0,0,0,0
ENSG00000239945,0,0,0,0,0
ENSG00000205002,0,0,0,0,0
ENSG00000090861,0,0,0,0,0
ENSG00000279928,0,0,0,0,0
ENSG00000181409,0,1,0,1,0
ENSG00000228463,0,0,0,0,0
ENSG00000236743,0,0,0,0,0
ENSG00000165029,0,0,0,0,0
ENSG00000144452,0,0,0,0,0
ENSG00000278566,0,0,0,0,0
ENSG00000179869,0,0,0,0,0
ENSG00000235146,0,0,0,0,0
ENSG00000154262,0,0,0,0,0
ENSG00000141338,0,0,0,0,0
ENSG00000154258,0,0,0,0,0
[download]

Comment on match two files Select or Download Code

Replies are listed 'Best First'.
Re: match two files by Corion (Patriarch) on Jun 03, 2020 at 09:11 UTC
This is a FAQ. See perlfaq4 on How do I compute the intersection of two arrays?. Your code is slow because for every item in `@lines1` it looks at all items in `@lines2`. If you precompute a lookup table ("hash", in Perl data structures) for the items in `@lines2`, you can find the items in `@lines2` much faster.	[reply] [d/l] [select]
Re: match two files by hippo (Bishop) on Jun 03, 2020 at 09:21 UTC
How I can shorter the running time by change some code? Although it's hard to spot because of the random indenting, you have a pair of nested loops. Inside the inner loop you have this line: `$i++` [download] which serves absolutely no purpose. The first change you should make is therefore to remove this line. Then you might look at your algorithm. Why are you doing the same processing on the entries in `@lines2` over and over again? Just process it once, pop the results in a hash for fast lookup and your code will whizz. Three more tips: use strict use warnings pick an indentation scheme and stick to it. perltidy can help to enforce this. Good luck.	[reply] [d/l] [select]
Re: match two files by jwkrahn (Abbot) on Jun 03, 2020 at 12:32 UTC
This will probably shorten the running time but I don't have your data to test it on, so good luck. #!/usr/bin/perl use warnings; use strict; use Fcntl ':seek'; open my $CSV, '<', 'donor_82_01.csv' or die "Cannot open 'donor_82_01. +csv' because: $!"; my $pos = tell $CSV; my %csv_data; while ( <$CSV> ) { my ( $first ) = split /,+/; push @{ $csv_data{ $first } }, $pos; $pos = tell $CSV; } open my $TAB, '<', 'tmp12' or die "Cannot open 'tmp12' because: $!"; open my $OUT, '>', 'tmp12_02' or die "Cannot open 'tmp12_02' because: +$!"; while ( <$TAB> ) { my ( $first, $second ) = split /\t+/; next unless exists $csv_data{ $second }; for my $pos ( @{ $csv_data{ $second } } ) { seek $CSV, $pos, SEEK_SET or die "Cannot seek on 'dono +r_82_01.csv' because: $!"; print $OUT "$first,", scalar <$CSV>; } } close $CSV; close $TAB; close $OUT; [download]	[reply] [d/l]
Re^2: match two files by yueli711 (Sexton) on Jun 04, 2020 at 05:02 UTC
Hello jwkrahn, Thank you so much for your useful code! Thank you again and really appreciated! `li@li-HP-$ perl match12.pl Use of uninitialized value $second in exists at match12.pl line 25, <$ +TAB> line 1. Use of uninitialized value $second in hash element at match12.pl line +26, <$TAB> line 1.` [download]	[reply] [d/l]
Re^3: match two files by jwkrahn (Abbot) on Jun 04, 2020 at 18:26 UTC
Hi! To get rid of the warning messages change the line: `my ( $first, $second ) = split /\t+/;` [download] To this: `my ( $first, $second ) = split or next;` [download]	[reply] [d/l] [select]
Re^4: match two files by yueli711 (Sexton) on Jun 05, 2020 at 02:09 UTC
Re^5: match two files by jwkrahn (Abbot) on Jun 05, 2020 at 23:44 UTC
Some notes below your chosen depth have not been shown here
Re: match two files by perlfan (Vicar) on Jun 03, 2020 at 13:07 UTC
Here's how I'd do it (for clarity, this was basically suggested in the first reply) - code untested : use strict; use warnings; use Tie::Hash::Indexed; tie my %lines1, 'Tie::Hash::Indexed'; # gives you the ordered hash open my $IN1, '<', "tmp12" or die "Cannot open this file: $! +"; open my $IN2, '<', "donor_82_01.csv" or die "Cannot open this file: $? +"; # step 1, cache contents of $IN1 (read the first file once) # populate %lines1 "cache" for my $item1 (<$IN1>) { @tmp1 = split( /\t+/, $item1 ); $lines1{ $tmp[1] } = \@tmp1; # save full $item1 line, keyed on +$tmp[1] } # step 2, iterate over contents of $IN2 / look up in %lines1 to compar +e open my $OUT, '>', "tmp12_01" or die "Cannot open this file: $?"; LOOKUP_AND_COMPARE: for $item2 (@lines2) { #chomp $item2; # not needed, see last line my @tmp2 = split( /\,+/, $item2 ); # -- look up if ( 'ARRAY' eq $lines1{ $tmp2[0] } ) { my @tmp1 = @{ $lines1{ $tmp2[0] } }; # for clarity, not act +ually needed; can get value via "$lines1{ $tmp2[0] }->[0]" print $OUT $tmp1[0], ",", $item2; #<-updated to fix + bareword from old code last LOOKUP_AND_COMPARE; } } #print $OUT "\n"; # probably don't need if you don't "chomp $it +em2" [download] Additional optimizations, depending on your constraint (timeversus space): if time, cache the larger of the 2 files if space, cache the smaller of the 2 files The lesson here, as stated below is to not nest your loops. It's called "computational complexity". Basically only want to have at most 1 level of looping. The line, `if ( 'ARRAY' eq $lines1{ $tmp2[0] } ) {` is the "constant time" look up capability that is being provided for by the ordered caching of the first file above and how you avoid the inner loop.	[reply] [d/l] [select]
Re^2: match two files by hippo (Bishop) on Jun 03, 2020 at 13:55 UTC
`print OUT $tmp1[0], ",", $item2;` There is no bareword filehandle `OUT` anywhere else in your code. Perhaps you meant `$OUT`? warnings catches these.	[reply] [d/l] [select]
Re^3: match two files by perlfan (Vicar) on Jun 03, 2020 at 14:09 UTC
Good catch. for OP's benefit add, `use strict; use warnings;` [download] And fixed the bareword file handle. Missed that when updating their code. :) ty....	[reply] [d/l]
Re^2: match two files by yueli711 (Sexton) on Jun 04, 2020 at 04:57 UTC
Hello perlfan, Thank you so much for your useful code!I already `$ sudo cpan Tie::File::AsHash` It still got this error. Thank you again and really appreciated! `li@lix:~$ perl match11.pl Can't locate Tie/Hash/Indexed.pm in @INC (you may need to install the +Tie::Hash::Indexed module) (@INC contains: /etc/perl /usr/local/lib/x +86_64-linux-gnu/perl/5.26.1 /usr/local/share/perl/5.26.1 /usr/lib/x86 +_64-linux-gnu/perl5/5.26 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/p +erl/5.26 /usr/share/perl/5.26 /usr/local/lib/site_perl /usr/lib/x86_6 +4-linux-gnu/perl-base) at match11.pl line 4. BEGIN failed--compilation aborted at match11.pl line 4.` [download]	[reply] [d/l] [select]
Re^3: match two files by marto (Cardinal) on Jun 04, 2020 at 11:43 UTC
"I already $ `sudo cpan Tie::File::AsHash` It still got this error. This module is not used by the code you thanked perlfan for. The error suggests you install Tie::Hash::Indexed, which has many install failures.	[reply] [d/l]
Re^4: match two files by yueli711 (Sexton) on Jun 05, 2020 at 02:53 UTC
Re^5: match two files by marto (Cardinal) on Jun 05, 2020 at 06:16 UTC
Re^3: match two files by hippo (Bishop) on Jun 04, 2020 at 09:02 UTC
The error message which you quoted not only tells you what's wrong but even goes so far as to suggest what you may need to do in order to fix it. Did you read it? Did you do what it suggested? What happened then?	[reply]


Do you know where your variables are?
	PerlMonks