Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

how to avoid full scan in file.

by EBK (Sexton)
on May 24, 2019 at 23:03 UTC ( [id://11100496]=perlquestion: print w/replies, xml ) Need Help??

EBK has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys, I have this code working almost perfectly. I am reading the file "A" and looking for a match in file "B" and decrement a value from file A. My big problem is that I am doing a full scan in the file "B" and my file "A" has 75k lines and file "B" has 880 lines. Do you have any idea about how to avoid full scan? Example of two files File "A"

l100101,aaaaaaa,a_0100,loc,10,1

l100101,aaaaaaa,a_0100,loc,11,1

l100101,aaaaaaa,a_0100,loc,12,6

File "B"

l103709,bbbbbbb,c_0200,929

l100109,bbbbbbb,b_0100,442

l100107,bbbbbbb,c_0300,389

#!/usr/bin/perl use strict; use warnings; $|=1; my $filea = $ARGV[0]; my $fileb = $ARGV[1]; my $FileC = "result.csv"; open ( FA, '<', $filea) || die ( "File $filea Not Found!" ); open ( FB, '<', $fileb) || die ( "File $fileb Not Found!" ); open ( FC, ">", $FileC) || die ( "File $FileC Not Found!" ); my @B; while ( <FB> ) { chomp; my($look, $sec, $cls, $max) = split ","; push @B, [$look, $sec, $cls, $max]; } my @A; while ( <FA> ) { chomp; my($look, $sec, $cls, $att, $idx, $qtd) = split ","; push @A, [$look, $sec, $cls, $att, $idx, $qtd]; } my $i = 1; my $j = 0; my $k = 0; my $count=0; while ( 1 ) { # -- keep looping til nothing is modified -- my $modified=0; $j = 0; foreach my $row ( @A ) { # -- loop through FileA, $j is rowcount -- $j++; $k=0; # -- loop through FileB, $k is linecount -- foreach my $line ( @B ) { $k++; my $idx1= @$line[0].@$line[1].@$line[2]; my $idx2= @$row[0].@$row[1].@$row[2]; # -- has to match on the index fields -- if ($idx1 eq $idx2) { my $max = @$line[3]; my $tot = @$row[5] -1; last if $count == $max; if ( $tot >= 0 ) { #print "FileA[".$j."]: ".join(",", @$row[0],@$ +row[1],@$row[3],@$row[4],@$row[5] )."\n"; print FC join(",", @$row[0],@$row[1],@$row[3], +@$row[4],$max )."\n"; $count++; @$row[5]=$tot; $modified = 1; } } } } if ((! $modified ) || ($i > 10)) { last; } }

Replies are listed 'Best First'.
Re: how to avoid full scan in file.
by LanX (Saint) on May 24, 2019 at 23:24 UTC
    There are many issues with your code I can't address now.

    But your basic problem is using two loops resulting in 880*75000 comparisons. And not the full scan of files.

    Using lookup hashes %A and %B with would speed this up dramatically.

    Just use $look;$sec;$cls as keys.

    The fastest way are hash slices then:

    @A{keys %B} will give you the intersection.

    Edit

    But maybe easier to understand is to loop over the keys of the smaller hash and to lookup in the bigger one. This will also preserve order.

    Update

    Same principle of you want to avoid holding the bigger file in memory, just read it line per line while comparing against a hash of the smaller one.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      I got but how can I do this because my knowledge in perl is limited. Which direction I have to understand first?
Re: how to avoid full scan in file.
by Cristoforo (Curate) on May 25, 2019 at 01:13 UTC
    I tried to get some of what you need but didn't get the use of $i, $j, $k or $modified.

    This isn't a complete solution, but it uses the suggestion by LanX to create a hash of the smaller B file and loop through the A file just once. This should speed up your program considerably.

    I used 2 pseudo files to stand in for your real files. I hope this gives you some direction for your problem.

    #!/usr/bin/perl use strict; use warnings; $|=1; my $fileA =<<EOF; l100101,aaaaaaa,a_0100,loc,10,1 l100101,aaaaaaa,a_0100,loc,11,1 l100101,aaaaaaa,a_0100,loc,12,6 EOF my $fileB =<<EOF; l103709,bbbbbbb,c_0200,929 l100109,bbbbbbb,b_0100,442 l100107,bbbbbbb,c_0300,389 EOF my $filea = $ARGV[0]; my $fileb = $ARGV[1]; my $FileC = "result.csv"; open ( FA, '<', \$fileA) || die ( "File $filea Not Found!" ); open ( FB, '<', \$fileB) || die ( "File $fileb Not Found!" ); #open ( FC, ">", $FileC) || die ( "File $FileC Not Found!" ); my %B; while ( <FB> ) { chomp; my($look, $sec, $cls, $max) = split ","; $B{"$look,$sec,$cls"} = $max; } my @A; while ( <FA> ) { chomp; my($look, $sec, $cls, $att, $idx, $qtd) = split ","; my $keyA = "$look,$sec,$cls"; if (exists $B{$keyA}) { my $max = $B{$keyA}; my $tot = $qtd - 1; if ($tot >= 0) { print join(",", $look, $sec, $cls, $att, $idx, $max), "\n +"; } } }
      I got this but the result I receive is not the same from the first script. I was analysing this code and I notice I will not cover the all combinations. My first result was 6382 lines and the result of this script was 928. The lines of second result file is in the first result file but it still missing some lines

        Looks like your script creates multiple output records, so if

        l100107,bbbbbbb,c_0300,loc,12,6
        in FileA matches
        l100107,bbbbbbb,c_0300,389
        in FileB, the output is 6 lines (the value of $qtd the last column)
        l100107,bbbbbbb,loc,12,389
        l100107,bbbbbbb,loc,12,389
        l100107,bbbbbbb,loc,12,389
        l100107,bbbbbbb,loc,12,389
        l100107,bbbbbbb,loc,12,389
        l100107,bbbbbbb,loc,12,389
        

        Is that what you want ?

        Also, can you please explain what this code line does.

        last if $count == $max;
        poj
Re: how to avoid full scan in file.
by haukex (Archbishop) on May 25, 2019 at 09:08 UTC

    Thank you for providing sample input, but unfortunately, when I run the code on this sample input, the output is empty. Could you provide sample input that produces some output, and to play it safe also provide that output, each inside <code> tags? See also Short, Self-Contained, Correct Example.

    In general, as has already been suggested, a hash table provides for much faster lookups than a linear scan with nested loops.

    use warnings; use strict; use List::Util qw/ shuffle /; use Time::HiRes qw/ gettimeofday tv_interval /; use Test::More tests=>2; my @looking_for = qw/ foo bar quz baz /; my @looking_in = shuffle qw/ foo bar quz baz / x 100_000, qw/ some other stuff we're not looking for / x 2_000_000; { my $t0 = [gettimeofday]; my $found_count; for my $haystack (@looking_in) { for my $needle (@looking_for) { if ( $needle eq $haystack ) { $found_count++; } } } is $found_count, 400_000, 'linear scan'; diag sprintf "that took %.3fs", tv_interval($t0); } { my $t0 = [gettimeofday]; my $found_count; my %needles_hash = map { ($_=>1) } @looking_for; diag 'needles_hash: ', explain \%needles_hash; for my $haystack (@looking_in) { if ( $needles_hash{$haystack} ) { $found_count++; } } is $found_count, 400_000, 'hash lookup'; diag sprintf "that took %.3fs", tv_interval($t0); } __END__ 1..2 ok 1 - linear scan # that took 2.274s # needles_hash: { # 'bar' => 1, # 'baz' => 1, # 'foo' => 1, # 'quz' => 1 # } ok 2 - hash lookup # that took 0.599s

    When you're asking the question "do the strings in the haystack contain any of the needles", or in general when what you're looking for is not a fixed string but can be expressed as a regex, an alternative is to build a regex.

    use warnings; use strict; use List::Util qw/ shuffle /; use Time::HiRes qw/ gettimeofday tv_interval /; use Test::More tests=>2; my @looking_for = qw/ foo bar quz baz /; my @looking_in = shuffle qw/ xyfooz abcbarx 123quzy abazz / x 10_000, qw/ some other stuff we're not looking for / x 100_000; { my $t0 = [gettimeofday]; my $found_count; for my $haystack (@looking_in) { for my $needle (@looking_for) { if ( $haystack =~ /\Q$needle\E/ ) { $found_count++; } } } is $found_count, 40_000, 'linear scan'; diag sprintf "that took %.3fs", tv_interval($t0); } { my $t0 = [gettimeofday]; my $found_count; my ($needles_regex) = map {qr/$_/} join '|', map {quotemeta} sort { length $b <=> length $a or $a cmp $b } @looking_for; diag "needles_regex: ", explain $needles_regex; for my $haystack (@looking_in) { if ( $haystack =~ $needles_regex ) { $found_count++; } } is $found_count, 40_000, 'regex'; diag sprintf "that took %.3fs", tv_interval($t0); } __END__ 1..2 ok 1 - linear scan # that took 2.716s # needles_regex: qr/bar|baz|foo|quz/ ok 2 - regex # that took 0.212s

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11100496]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (2)
As of 2024-04-26 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found