Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Merging Many Files Side by Side

by gone2015 (Deacon)
on Feb 20, 2009 at 00:37 UTC ( [id://745241]=note: print w/replies, xml ) Need Help??


in reply to Merging Many Files Side by Side

I assume you've tried it and discovered it's too slow ?

200 files at 6.9M lines each, each line three fields tab delimited -- if that's 60 bytes per line, you're reading ~80G bytes and writing ~27G (assuming roughly equal field sizes). On my little Linux box, I just timed cp foo bar for an ~13G file and it took ~15mins -- so 80G + 27G looks like about two hours work ? Of course, with faster drives and what-not, you may do better.

I've never tried opening 200 files at once...

It's hard to say what the processing time will add to this. It looks pretty I/O bound. If processing time is an issue, then I'd look at reading the files a chunk at a time... but I'd have to be convinced there was a problem; and even then I'm not sure that processing chunks of the files in Perl would be quicker than using Perl to read a line at a time.

So... what are you expecting, and what do you get when you try the straightforward approach ?

Wishing to squeeze as much as possible out of the inner loop, I think the following may be faster:

my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my @l = () ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + push @l, $c2 || "0" ; } ; print OUTFILE join("\t", @l), "\n" ; } ;
or possibly:
my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
You seem concerned that column 2 might be empty, or that there may be blank or whitespace only lines in the input files. If you could guarantee that no line would be empty and no line would be whitespace only, then:
my $atleastone = 1; while (defined($atleastone)) { $atleastone = undef ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ($atleastone, $c2) = split if defined($_ = <$op>) ; $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
squeezes out a tiny fraction ! (Setting $atleastone to something (even 0 or "") if <$op> returns a line.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://745241]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-25 23:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found