Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^4: how to avoid full scan in file.

by EBK (Sexton)
on May 25, 2019 at 16:43 UTC ( [id://11100521]=note: print w/replies, xml ) Need Help??


in reply to Re^3: how to avoid full scan in file.
in thread how to avoid full scan in file.

Here the exactly example. I match the file A with the file B through the keys "l100107,bbbbbb,a_0100" so I decrement the $qtd from file A in this example 16 and 24 till 0. Notice that if you sum up this two values they are the $max of file B 40. There is no possibility that my process result different values from both files. I put the $tot in the result file to explain the flow.
File A l100107,bbbbbb,a_0100,loc,13,16 l100107,bbbbbb,a_0100,loc,14,24 File B l100107,bbbbbb,a_0100,40 Result File l100107,bbbbbb,loc,13,40,15 l100107,bbbbbb,loc,14,40,23 l100107,bbbbbb,loc,13,40,14 l100107,bbbbbb,loc,14,40,22 l100107,bbbbbb,loc,13,40,13 l100107,bbbbbb,loc,14,40,21 l100107,bbbbbb,loc,13,40,12 l100107,bbbbbb,loc,14,40,20 l100107,bbbbbb,loc,13,40,11 l100107,bbbbbb,loc,14,40,19 l100107,bbbbbb,loc,13,40,10 l100107,bbbbbb,loc,14,40,18 l100107,bbbbbb,loc,13,40,9 l100107,bbbbbb,loc,14,40,17 l100107,bbbbbb,loc,13,40,8 l100107,bbbbbb,loc,14,40,16 l100107,bbbbbb,loc,13,40,7 l100107,bbbbbb,loc,14,40,15 l100107,bbbbbb,loc,13,40,6 l100107,bbbbbb,loc,14,40,14 l100107,bbbbbb,loc,13,40,5 l100107,bbbbbb,loc,14,40,13 l100107,bbbbbb,loc,13,40,4 l100107,bbbbbb,loc,14,40,12 l100107,bbbbbb,loc,13,40,3 l100107,bbbbbb,loc,14,40,11 l100107,bbbbbb,loc,13,40,2 l100107,bbbbbb,loc,14,40,10 l100107,bbbbbb,loc,13,40,1 l100107,bbbbbb,loc,14,40,9 l100107,bbbbbb,loc,13,40,0 l100107,bbbbbb,loc,14,40,8 l100107,bbbbbb,loc,14,40,7 l100107,bbbbbb,loc,14,40,6 l100107,bbbbbb,loc,14,40,5 l100107,bbbbbb,loc,14,40,4 l100107,bbbbbb,loc,14,40,3 l100107,bbbbbb,loc,14,40,2 l100107,bbbbbb,loc,14,40,1 l100107,bbbbbb,loc,14,40,0

Replies are listed 'Best First'.
Re^5: how to avoid full scan in file.
by poj (Abbot) on May 25, 2019 at 17:15 UTC

    Is it OK if the records are sorted like this ?

    l100107,bbbbbb,loc,13,40,15
    l100107,bbbbbb,loc,13,40,14
    .. etc
    l100107,bbbbbb,loc,13,40,2
    l100107,bbbbbb,loc,13,40,1
    l100107,bbbbbb,loc,13,40,0
    l100107,bbbbbb,loc,14,40,23
    l100107,bbbbbb,loc,14,40,22
    .. etc
    l100107,bbbbbb,loc,14,40,2
    l100107,bbbbbb,loc,14,40,1
    l100107,bbbbbb,loc,14,40,0
    
    poj
      At this point in my process, I can not sort the $ idx column. This distribution I am using is similar to a distribution of playing cards. And I remove 1 item from each loop of each $ idx until it reaches 0.

        Use hashes to match records and store them in an array that you can iterate over multiple times to create the output in the original order. The input files are only read once.

        #!/usr/bin/perl use strict; use warnings; # config my $FileA = $ARGV[0] || 'FileA.txt'; my $FileB = $ARGV[1] || 'FileB.txt'; my $FileC = "result.csv"; # read in smaller file key1,key2,key3,count # example l103709,bbbbbbb,c_0200,929 # and create lookup hash_B my %hash_B = (); open FB, '<', $FileB or die "File $FileB Not Found!"; my $countB = 0; while ( <FB> ) { next unless /\S/; # skip blank lines chomp; my ($look, $sec, $cls, $max) = split ","; my $key = join ',',$look, $sec, $cls; if ( exists $hash_B{$key} ){ warn "WARNING : Duplicate key [$key] in $FileB at line $.\n"; } else { $hash_B{$key} = $max; ++$countB; } } close FB; printf "%d lines read from %s\n",$countB,$FileB; # scan larger FileA to match with smaller FileB # example l103709,bbbbbbb,c_0200,loc,10,1 open FA, '<', $FileA or die "File $FileA Not Found!"; my $countA = 0; my $sum_qtd = 0; my %hash_A = (); my @match = (); while ( <FA> ) { next unless /\S/; # skip blank lines chomp; my ($look, $sec, $cls, $att, $idx, $qtd) = split ","; my $key = join ',',$look, $sec, $cls; if (exists $hash_B{$key}){ # match found my $max = $hash_B{$key}; $hash_A{$key} += $qtd; $sum_qtd += $qtd; # use to check output my $record = join ',',$look,$sec,$att,$idx,$max; push @match,[$record,$qtd]; # store for output } ++$countA; } close FA; printf "%d lines read from %s\n",$countA,$FileA; printf "%d matches with %s\n",scalar @match,$FileB; # check total of qtd in FileA # equals max in FileB for each matched key my $error = 0; for my $key (keys %hash_A){ if ($hash_A{$key} != $hash_B{$key}){ ++$error; printf "WARNING $key fileA = %d ; fileB = %d\n",$hash_A{$key},$has +h_B{$key}; } } if ($error== 0){ print "OK - no errors\n"; } else { printf "ERROR - see %d warnings\n",$error; } # output records printf "%d lines expected\n",$sum_qtd; open FC, '>', $FileC or die "File $FileC Not Found!"; my $count_zero = 0; my $countC = 0; while ($count_zero < @match ){ for (@match){ my ($record,$qtd) = @$_; if ($qtd > 0){ --$qtd; ++$count_zero if $qtd == 0; $_->[1] = $qtd ; # update qtd in array print FC join ',',$record,$qtd."\n"; ++$countC; } } } close FC; printf "%d lines written to %s\n",$countC,$FileC;
        poj

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11100521]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-25 06:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found