how to avoid full scan in file.

EBK has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I have this code working almost perfectly. I am reading the file "A" and looking for a match in file "B" and decrement a value from file A. My big problem is that I am doing a full scan in the file "B" and my file "A" has 75k lines and file "B" has 880 lines. Do you have any idea about how to avoid full scan? Example of two files File "A"

l100101,aaaaaaa,a_0100,loc,10,1

l100101,aaaaaaa,a_0100,loc,11,1

l100101,aaaaaaa,a_0100,loc,12,6

File "B"

l103709,bbbbbbb,c_0200,929

l100109,bbbbbbb,b_0100,442

l100107,bbbbbbb,c_0300,389

#!/usr/bin/perl

use strict;
use warnings;

$|=1;

my $filea = $ARGV[0];
my $fileb = $ARGV[1];
my $FileC = "result.csv";


open ( FA, '<', $filea) || die ( "File $filea Not Found!" );
open ( FB, '<', $fileb) || die ( "File $fileb Not Found!" );
open ( FC, ">", $FileC) || die ( "File $FileC Not Found!" );

my @B;
while ( <FB> ) {
    chomp;
    my($look, $sec, $cls, $max) = split ",";
    push @B, [$look, $sec, $cls, $max];
}

my @A;
while ( <FA> ) {
    chomp;
    my($look, $sec, $cls, $att, $idx, $qtd) = split ",";
    push @A, [$look, $sec, $cls, $att, $idx, $qtd];
}

my $i = 1;
my $j = 0;
my $k = 0;
my $count=0;
while ( 1 ) {
    # -- keep looping til nothing is modified --
    my $modified=0;
    $j = 0;
    foreach my $row ( @A ) {
        # -- loop through FileA, $j is rowcount --
        $j++;
        $k=0;
        # -- loop through FileB, $k is linecount --
        foreach my $line ( @B ) {
            $k++;
            my $idx1= @$line[0].@$line[1].@$line[2];
            my $idx2= @$row[0].@$row[1].@$row[2];
            # -- has to match on the index fields --
            if ($idx1 eq $idx2) {
                my $max = @$line[3];
                my $tot = @$row[5] -1;
                last if $count == $max;
                if ( $tot >= 0 ) {
                        #print "FileA[".$j."]: ".join(",", @$row[0],@$
+row[1],@$row[3],@$row[4],@$row[5]  )."\n";
                        print FC join(",", @$row[0],@$row[1],@$row[3],
+@$row[4],$max  )."\n";
                        $count++;

                    @$row[5]=$tot;
                    $modified = 1;
                }
             }
        }
    }
    if ((! $modified ) || ($i > 10)) {
        last;
    }
}
[download]

Comment on how to avoid full scan in file. Download Code

Replies are listed 'Best First'.
Re: how to avoid full scan in file. by LanX (Saint) on May 24, 2019 at 23:24 UTC
There are many issues with your code I can't address now. But your basic problem is using two loops resulting in 880*75000 comparisons. And not the full scan of files. Using lookup hashes %A and %B with would speed this up dramatically. Just use `$look;$sec;$cls` as keys. The fastest way are hash slices then: `@A{keys %B}` will give you the intersection. Edit But maybe easier to understand is to loop over the keys of the smaller hash and to lookup in the bigger one. This will also preserve order. Update Same principle of you want to avoid holding the bigger file in memory, just read it line per line while comparing against a hash of the smaller one. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery FootballPerl is like chess, only without the dice}	[reply] [d/l] [select]
Re^2: how to avoid full scan in file. by EBK (Sexton) on May 25, 2019 at 00:01 UTC
I got but how can I do this because my knowledge in perl is limited. Which direction I have to understand first?	[reply]
Re: how to avoid full scan in file. by Cristoforo (Curate) on May 25, 2019 at 01:13 UTC
I tried to get some of what you need but didn't get the use of `$i, $j, $k or $modified`. This isn't a complete solution, but it uses the suggestion by LanX to create a hash of the smaller B file and loop through the A file just once. This should speed up your program considerably. I used 2 pseudo files to stand in for your real files. I hope this gives you some direction for your problem. #!/usr/bin/perl use strict; use warnings; $\|=1; my $fileA =<<EOF; l100101,aaaaaaa,a_0100,loc,10,1 l100101,aaaaaaa,a_0100,loc,11,1 l100101,aaaaaaa,a_0100,loc,12,6 EOF my $fileB =<<EOF; l103709,bbbbbbb,c_0200,929 l100109,bbbbbbb,b_0100,442 l100107,bbbbbbb,c_0300,389 EOF my $filea = $ARGV[0]; my $fileb = $ARGV[1]; my $FileC = "result.csv"; open ( FA, '<', \$fileA) \|\| die ( "File $filea Not Found!" ); open ( FB, '<', \$fileB) \|\| die ( "File $fileb Not Found!" ); #open ( FC, ">", $FileC) \|\| die ( "File $FileC Not Found!" ); my %B; while ( <FB> ) { chomp; my($look, $sec, $cls, $max) = split ","; $B{"$look,$sec,$cls"} = $max; } my @A; while ( <FA> ) { chomp; my($look, $sec, $cls, $att, $idx, $qtd) = split ","; my $keyA = "$look,$sec,$cls"; if (exists $B{$keyA}) { my $max = $B{$keyA}; my $tot = $qtd - 1; if ($tot >= 0) { print join(",", $look, $sec, $cls, $att, $idx, $max), "\n +"; } } } [download]	[reply] [d/l] [select]
Re^2: how to avoid full scan in file. by EBK (Sexton) on May 25, 2019 at 03:54 UTC
I got this but the result I receive is not the same from the first script. I was analysing this code and I notice I will not cover the all combinations. My first result was 6382 lines and the result of this script was 928. The lines of second result file is in the first result file but it still missing some lines	[reply]
Re^3: how to avoid full scan in file. by poj (Abbot) on May 25, 2019 at 14:38 UTC
Looks like your script creates multiple output records, so if l100107,bbbbbbb,c_0300,loc,12,6 in FileA matches l100107,bbbbbbb,c_0300,389 in FileB, the output is 6 lines (the value of $qtd the last column) l100107,bbbbbbb,loc,12,389 l100107,bbbbbbb,loc,12,389 l100107,bbbbbbb,loc,12,389 l100107,bbbbbbb,loc,12,389 l100107,bbbbbbb,loc,12,389 l100107,bbbbbbb,loc,12,389 Is that what you want ? Also, can you please explain what this code line does. `last if $count == $max;` [download] poj	[reply] [d/l]
Re^4: how to avoid full scan in file. by EBK (Sexton) on May 25, 2019 at 16:43 UTC
Re^5: how to avoid full scan in file. by poj (Abbot) on May 25, 2019 at 17:15 UTC
Some notes below your chosen depth have not been shown here
Re^6: how to avoid full scan in file. by EBK (Sexton) on May 25, 2019 at 18:00 UTC
Re: how to avoid full scan in file. by haukex (Archbishop) on May 25, 2019 at 09:08 UTC
Thank you for providing sample input, but unfortunately, when I run the code on this sample input, the output is empty. Could you provide sample input that produces some output, and to play it safe also provide that output, each inside `<code>` tags? See also Short, Self-Contained, Correct Example. In general, as has already been suggested, a hash table provides for much faster lookups than a linear scan with nested loops. use warnings; use strict; use List::Util qw/ shuffle /; use Time::HiRes qw/ gettimeofday tv_interval /; use Test::More tests=>2; my @looking_for = qw/ foo bar quz baz /; my @looking_in = shuffle qw/ foo bar quz baz / x 100_000, qw/ some other stuff we're not looking for / x 2_000_000; { my $t0 = [gettimeofday]; my $found_count; for my $haystack (@looking_in) { for my $needle (@looking_for) { if ( $needle eq $haystack ) { $found_count++; } } } is $found_count, 400_000, 'linear scan'; diag sprintf "that took %.3fs", tv_interval($t0); } { my $t0 = [gettimeofday]; my $found_count; my %needles_hash = map { ($_=>1) } @looking_for; diag 'needles_hash: ', explain \%needles_hash; for my $haystack (@looking_in) { if ( $needles_hash{$haystack} ) { $found_count++; } } is $found_count, 400_000, 'hash lookup'; diag sprintf "that took %.3fs", tv_interval($t0); } __END__ 1..2 ok 1 - linear scan # that took 2.274s # needles_hash: { # 'bar' => 1, # 'baz' => 1, # 'foo' => 1, # 'quz' => 1 # } ok 2 - hash lookup # that took 0.599s [download] When you're asking the question "do the strings in the haystack contain any of the needles", or in general when what you're looking for is not a fixed string but can be expressed as a regex, an alternative is to build a regex. use warnings; use strict; use List::Util qw/ shuffle /; use Time::HiRes qw/ gettimeofday tv_interval /; use Test::More tests=>2; my @looking_for = qw/ foo bar quz baz /; my @looking_in = shuffle qw/ xyfooz abcbarx 123quzy abazz / x 10_000, qw/ some other stuff we're not looking for / x 100_000; { my $t0 = [gettimeofday]; my $found_count; for my $haystack (@looking_in) { for my $needle (@looking_for) { if ( $haystack =~ /\Q$needle\E/ ) { $found_count++; } } } is $found_count, 40_000, 'linear scan'; diag sprintf "that took %.3fs", tv_interval($t0); } { my $t0 = [gettimeofday]; my $found_count; my ($needles_regex) = map {qr/$_/} join '\|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } @looking_for; diag "needles_regex: ", explain $needles_regex; for my $haystack (@looking_in) { if ( $haystack =~ $needles_regex ) { $found_count++; } } is $found_count, 40_000, 'regex'; diag sprintf "that took %.3fs", tv_interval($t0); } __END__ 1..2 ok 1 - linear scan # that took 2.716s # needles_regex: qr/bar\|baz\|foo\|quz/ ok 2 - regex # that took 0.212s [download]	[reply] [d/l] [select]


P is for Practical
	PerlMonks

how to avoid full scan in file.

Edit

Update