Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I assume you've tried it and discovered it's too slow ?

200 files at 6.9M lines each, each line three fields tab delimited -- if that's 60 bytes per line, you're reading ~80G bytes and writing ~27G (assuming roughly equal field sizes). On my little Linux box, I just timed cp foo bar for an ~13G file and it took ~15mins -- so 80G + 27G looks like about two hours work ? Of course, with faster drives and what-not, you may do better.

I've never tried opening 200 files at once...

It's hard to say what the processing time will add to this. It looks pretty I/O bound. If processing time is an issue, then I'd look at reading the files a chunk at a time... but I'd have to be convinced there was a problem; and even then I'm not sure that processing chunks of the files in Perl would be quicker than using Perl to read a line at a time.

So... what are you expecting, and what do you get when you try the straightforward approach ?

Wishing to squeeze as much as possible out of the inner loop, I think the following may be faster:

my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my @l = () ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + push @l, $c2 || "0" ; } ; print OUTFILE join("\t", @l), "\n" ; } ;
or possibly:
my $atleastone = 1; while ($atleastone) { $atleastone = 0 ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ++$atleastone and (undef, $c2) = split if defined($_ = <$op>) ; + $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
You seem concerned that column 2 might be empty, or that there may be blank or whitespace only lines in the input files. If you could guarantee that no line would be empty and no line would be whitespace only, then:
my $atleastone = 1; while (defined($atleastone)) { $atleastone = undef ; my $l = '' ; foreach my $op (@handles) { my $c2 ; ($atleastone, $c2) = split if defined($_ = <$op>) ; $l .= ($c2 || "0") . "\t" ; } ; chop $l ; # Discard trailing "\t" print OUTFILE "$l\n" ; } ;
squeezes out a tiny fraction ! (Setting $atleastone to something (even 0 or "") if <$op> returns a line.)


In reply to Re: Merging Many Files Side by Side by gone2015
in thread Merging Many Files Side by Side by sesemin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-15 06:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found