|Come for the quick hacks, stay for the epiphanies.
Re: Merging Many Files Side by Sideby gone2015 (Deacon)
|on Feb 20, 2009 at 00:37 UTC
I assume you've tried it and discovered it's too slow ?
200 files at 6.9M lines each, each line three fields tab delimited -- if that's 60 bytes per line, you're reading ~80G bytes and writing ~27G (assuming roughly equal field sizes). On my little Linux box, I just timed cp foo bar for an ~13G file and it took ~15mins -- so 80G + 27G looks like about two hours work ? Of course, with faster drives and what-not, you may do better.
I've never tried opening 200 files at once...
It's hard to say what the processing time will add to this. It looks pretty I/O bound. If processing time is an issue, then I'd look at reading the files a chunk at a time... but I'd have to be convinced there was a problem; and even then I'm not sure that processing chunks of the files in Perl would be quicker than using Perl to read a line at a time.
So... what are you expecting, and what do you get when you try the straightforward approach ?
Wishing to squeeze as much as possible out of the inner loop, I think the following may be faster:
You seem concerned that column 2 might be empty, or that there may be blank or whitespace only lines in the input files. If you could guarantee that no line would be empty and no line would be whitespace only, then:
squeezes out a tiny fraction ! (Setting $atleastone to something (even 0 or "") if <$op> returns a line.)