Re^4: how to avoid full scan in file.

Here the exactly example. I match the file A with the file B through the keys "l100107,bbbbbb,a_0100" so I decrement the $qtd from file A in this example 16 and 24 till 0. Notice that if you sum up this two values they are the $max of file B 40. There is no possibility that my process result different values from both files. I put the $tot in the result file to explain the flow.

File A
l100107,bbbbbb,a_0100,loc,13,16
l100107,bbbbbb,a_0100,loc,14,24

File B
l100107,bbbbbb,a_0100,40

Result File
l100107,bbbbbb,loc,13,40,15
l100107,bbbbbb,loc,14,40,23
l100107,bbbbbb,loc,13,40,14
l100107,bbbbbb,loc,14,40,22
l100107,bbbbbb,loc,13,40,13
l100107,bbbbbb,loc,14,40,21
l100107,bbbbbb,loc,13,40,12
l100107,bbbbbb,loc,14,40,20
l100107,bbbbbb,loc,13,40,11
l100107,bbbbbb,loc,14,40,19
l100107,bbbbbb,loc,13,40,10
l100107,bbbbbb,loc,14,40,18
l100107,bbbbbb,loc,13,40,9
l100107,bbbbbb,loc,14,40,17
l100107,bbbbbb,loc,13,40,8
l100107,bbbbbb,loc,14,40,16
l100107,bbbbbb,loc,13,40,7
l100107,bbbbbb,loc,14,40,15
l100107,bbbbbb,loc,13,40,6
l100107,bbbbbb,loc,14,40,14
l100107,bbbbbb,loc,13,40,5
l100107,bbbbbb,loc,14,40,13
l100107,bbbbbb,loc,13,40,4
l100107,bbbbbb,loc,14,40,12
l100107,bbbbbb,loc,13,40,3
l100107,bbbbbb,loc,14,40,11
l100107,bbbbbb,loc,13,40,2
l100107,bbbbbb,loc,14,40,10
l100107,bbbbbb,loc,13,40,1
l100107,bbbbbb,loc,14,40,9
l100107,bbbbbb,loc,13,40,0
l100107,bbbbbb,loc,14,40,8
l100107,bbbbbb,loc,14,40,7
l100107,bbbbbb,loc,14,40,6
l100107,bbbbbb,loc,14,40,5
l100107,bbbbbb,loc,14,40,4
l100107,bbbbbb,loc,14,40,3
l100107,bbbbbb,loc,14,40,2
l100107,bbbbbb,loc,14,40,1
l100107,bbbbbb,loc,14,40,0
[download]

Comment on Re^4: how to avoid full scan in file. Download Code

Replies are listed 'Best First'.
Re^5: how to avoid full scan in file. by poj (Abbot) on May 25, 2019 at 17:15 UTC
Is it OK if the records are sorted like this ? l100107,bbbbbb,loc,13,40,15 l100107,bbbbbb,loc,13,40,14 .. etc l100107,bbbbbb,loc,13,40,2 l100107,bbbbbb,loc,13,40,1 l100107,bbbbbb,loc,13,40,0 l100107,bbbbbb,loc,14,40,23 l100107,bbbbbb,loc,14,40,22 .. etc l100107,bbbbbb,loc,14,40,2 l100107,bbbbbb,loc,14,40,1 l100107,bbbbbb,loc,14,40,0 poj	[reply]
Re^6.1: how to avoid full scan in file. by EBK (Sexton) on May 25, 2019 at 18:02 UTC
At this point in my process, I can not sort the $ idx column. This distribution I am using is similar to a distribution of playing cards. And I remove 1 item from each loop of each $ idx until it reaches 0.	[reply]
Re: Re^6.1: how to avoid full scan in file. by poj (Abbot) on May 25, 2019 at 18:58 UTC
Use hashes to match records and store them in an array that you can iterate over multiple times to create the output in the original order. The input files are only read once. #!/usr/bin/perl use strict; use warnings; # config my $FileA = $ARGV[0] \|\| 'FileA.txt'; my $FileB = $ARGV[1] \|\| 'FileB.txt'; my $FileC = "result.csv"; # read in smaller file key1,key2,key3,count # example l103709,bbbbbbb,c_0200,929 # and create lookup hash_B my %hash_B = (); open FB, '<', $FileB or die "File $FileB Not Found!"; my $countB = 0; while ( <FB> ) { next unless /\S/; # skip blank lines chomp; my ($look, $sec, $cls, $max) = split ","; my $key = join ',',$look, $sec, $cls; if ( exists $hash_B{$key} ){ warn "WARNING : Duplicate key [$key] in $FileB at line $.\n"; } else { $hash_B{$key} = $max; ++$countB; } } close FB; printf "%d lines read from %s\n",$countB,$FileB; # scan larger FileA to match with smaller FileB # example l103709,bbbbbbb,c_0200,loc,10,1 open FA, '<', $FileA or die "File $FileA Not Found!"; my $countA = 0; my $sum_qtd = 0; my %hash_A = (); my @match = (); while ( <FA> ) { next unless /\S/; # skip blank lines chomp; my ($look, $sec, $cls, $att, $idx, $qtd) = split ","; my $key = join ',',$look, $sec, $cls; if (exists $hash_B{$key}){ # match found my $max = $hash_B{$key}; $hash_A{$key} += $qtd; $sum_qtd += $qtd; # use to check output my $record = join ',',$look,$sec,$att,$idx,$max; push @match,[$record,$qtd]; # store for output } ++$countA; } close FA; printf "%d lines read from %s\n",$countA,$FileA; printf "%d matches with %s\n",scalar @match,$FileB; # check total of qtd in FileA # equals max in FileB for each matched key my $error = 0; for my $key (keys %hash_A){ if ($hash_A{$key} != $hash_B{$key}){ ++$error; printf "WARNING $key fileA = %d ; fileB = %d\n",$hash_A{$key},$has +h_B{$key}; } } if ($error== 0){ print "OK - no errors\n"; } else { printf "ERROR - see %d warnings\n",$error; } # output records printf "%d lines expected\n",$sum_qtd; open FC, '>', $FileC or die "File $FileC Not Found!"; my $count_zero = 0; my $countC = 0; while ($count_zero < @match ){ for (@match){ my ($record,$qtd) = @$_; if ($qtd > 0){ --$qtd; ++$count_zero if $qtd == 0; $_->[1] = $qtd ; # update qtd in array print FC join ',',$record,$qtd."\n"; ++$countC; } } } close FC; printf "%d lines written to %s\n",$countC,$FileC; [download] poj	[reply] [d/l]
Re^2: Re^6.1: how to avoid full scan in file. by EBK (Sexton) on May 25, 2019 at 23:38 UTC
Re^3: Re^6.1: how to avoid full scan in file. by poj (Abbot) on May 26, 2019 at 07:53 UTC


Keep It Simple, Stupid
	PerlMonks