Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: how to avoid full scan in file.

by Cristoforo (Curate)
on May 25, 2019 at 01:13 UTC ( [id://11100503]=note: print w/replies, xml ) Need Help??


in reply to how to avoid full scan in file.

I tried to get some of what you need but didn't get the use of $i, $j, $k or $modified.

This isn't a complete solution, but it uses the suggestion by LanX to create a hash of the smaller B file and loop through the A file just once. This should speed up your program considerably.

I used 2 pseudo files to stand in for your real files. I hope this gives you some direction for your problem.

#!/usr/bin/perl use strict; use warnings; $|=1; my $fileA =<<EOF; l100101,aaaaaaa,a_0100,loc,10,1 l100101,aaaaaaa,a_0100,loc,11,1 l100101,aaaaaaa,a_0100,loc,12,6 EOF my $fileB =<<EOF; l103709,bbbbbbb,c_0200,929 l100109,bbbbbbb,b_0100,442 l100107,bbbbbbb,c_0300,389 EOF my $filea = $ARGV[0]; my $fileb = $ARGV[1]; my $FileC = "result.csv"; open ( FA, '<', \$fileA) || die ( "File $filea Not Found!" ); open ( FB, '<', \$fileB) || die ( "File $fileb Not Found!" ); #open ( FC, ">", $FileC) || die ( "File $FileC Not Found!" ); my %B; while ( <FB> ) { chomp; my($look, $sec, $cls, $max) = split ","; $B{"$look,$sec,$cls"} = $max; } my @A; while ( <FA> ) { chomp; my($look, $sec, $cls, $att, $idx, $qtd) = split ","; my $keyA = "$look,$sec,$cls"; if (exists $B{$keyA}) { my $max = $B{$keyA}; my $tot = $qtd - 1; if ($tot >= 0) { print join(",", $look, $sec, $cls, $att, $idx, $max), "\n +"; } } }

Replies are listed 'Best First'.
Re^2: how to avoid full scan in file.
by EBK (Sexton) on May 25, 2019 at 03:54 UTC
    I got this but the result I receive is not the same from the first script. I was analysing this code and I notice I will not cover the all combinations. My first result was 6382 lines and the result of this script was 928. The lines of second result file is in the first result file but it still missing some lines

      Looks like your script creates multiple output records, so if

      l100107,bbbbbbb,c_0300,loc,12,6
      in FileA matches
      l100107,bbbbbbb,c_0300,389
      in FileB, the output is 6 lines (the value of $qtd the last column)
      l100107,bbbbbbb,loc,12,389
      l100107,bbbbbbb,loc,12,389
      l100107,bbbbbbb,loc,12,389
      l100107,bbbbbbb,loc,12,389
      l100107,bbbbbbb,loc,12,389
      l100107,bbbbbbb,loc,12,389
      

      Is that what you want ?

      Also, can you please explain what this code line does.

      last if $count == $max;
      poj
        Here the exactly example. I match the file A with the file B through the keys "l100107,bbbbbb,a_0100" so I decrement the $qtd from file A in this example 16 and 24 till 0. Notice that if you sum up this two values they are the $max of file B 40. There is no possibility that my process result different values from both files. I put the $tot in the result file to explain the flow.
        File A l100107,bbbbbb,a_0100,loc,13,16 l100107,bbbbbb,a_0100,loc,14,24 File B l100107,bbbbbb,a_0100,40 Result File l100107,bbbbbb,loc,13,40,15 l100107,bbbbbb,loc,14,40,23 l100107,bbbbbb,loc,13,40,14 l100107,bbbbbb,loc,14,40,22 l100107,bbbbbb,loc,13,40,13 l100107,bbbbbb,loc,14,40,21 l100107,bbbbbb,loc,13,40,12 l100107,bbbbbb,loc,14,40,20 l100107,bbbbbb,loc,13,40,11 l100107,bbbbbb,loc,14,40,19 l100107,bbbbbb,loc,13,40,10 l100107,bbbbbb,loc,14,40,18 l100107,bbbbbb,loc,13,40,9 l100107,bbbbbb,loc,14,40,17 l100107,bbbbbb,loc,13,40,8 l100107,bbbbbb,loc,14,40,16 l100107,bbbbbb,loc,13,40,7 l100107,bbbbbb,loc,14,40,15 l100107,bbbbbb,loc,13,40,6 l100107,bbbbbb,loc,14,40,14 l100107,bbbbbb,loc,13,40,5 l100107,bbbbbb,loc,14,40,13 l100107,bbbbbb,loc,13,40,4 l100107,bbbbbb,loc,14,40,12 l100107,bbbbbb,loc,13,40,3 l100107,bbbbbb,loc,14,40,11 l100107,bbbbbb,loc,13,40,2 l100107,bbbbbb,loc,14,40,10 l100107,bbbbbb,loc,13,40,1 l100107,bbbbbb,loc,14,40,9 l100107,bbbbbb,loc,13,40,0 l100107,bbbbbb,loc,14,40,8 l100107,bbbbbb,loc,14,40,7 l100107,bbbbbb,loc,14,40,6 l100107,bbbbbb,loc,14,40,5 l100107,bbbbbb,loc,14,40,4 l100107,bbbbbb,loc,14,40,3 l100107,bbbbbb,loc,14,40,2 l100107,bbbbbb,loc,14,40,1 l100107,bbbbbb,loc,14,40,0
        At this point in my process, I can not sort the $ idx column. This distribution I am using is similar to a distribution of playing cards. And I remove 1 item from each loop of each $ idx until it reaches 0.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11100503]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (7)
As of 2024-03-28 15:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found