http://qs321.pair.com?node_id=971949

live4tech has asked for the wisdom of the Perl Monks concerning the following question:

I have a few very large text files (~50MB), millions of rows, 5 columns (numbers separated by spaces). The first row is a 2 column 'header'. Also each row ends in a \r\n and every other row is a \r\n on its own. My task was to do something quick and dirty to cut these large files into smaller files. The resulting smaller files would have 300,000 rows per file. I have been learning Perl to deal with just such tasks (I am still working through the Camel book in my 'spare time'). So I tried the following code:

my $pre = $ARGV[0]; my $linenum = 0; my $filenum = 0; open FILEOUT, '>', $pre."-".$filenum; while (<>) { if ($linenum <= 300000) { if (/^\r\n$/) # skip the linefeed carriage return lines, # do not increment line # counter or print line to file { } else { print FILEOUT $_; $linenum++; } } elsif ($linenum > 300000) { if (/^\r\n$/) # skip the linefeed carriage return lines, # do not increment line # counter or print line to file { } else { $linenum = 0; # reset line counter every 300,000 lines $filenum++; # increment file counter every 300,000 lines # and open new file handle open FILEOUT, '>', $pre."-".$filenum; print FILEOUT $_; } } } close FILEOUT;

This worked great, I just called the script with each filename on the command line one at a time; except that the new files had 299,701 or 299,702 rows instead of 300,000. I cannot understand how this would happen with the above code! It's really been sand in my shorts, but I bet it is something simple, something a good monk could pick up on! THANKS!

Replies are listed 'Best First'.
Re: Help with problem
by moritz (Cardinal) on May 23, 2012 at 07:37 UTC
    I've tried out your script, just changed the 300000 to 50 for easier debugging. The first output file has 51 lines (because you only stop writing when the line count is greater than your desired limit), all following files have 52 lines (because you don't count the first line being written).

    So I guess that the script is mostly working the way you want, but something's wrong with your testing of the script.

Re: Help with problem
by Athanasius (Archbishop) on May 23, 2012 at 07:29 UTC

    Update: Added use autodie; to the code.

    Hi live4tech,

    Three points:

    • You should Choose a Good, Descriptive Title for your posts.
    • It’s not a good idea to try to match on \r\n, as this brings in too many complications (as well as being non-portable). Much better to strip these first, then add them back only when needed (i.e., when printing). See the code below.
    • There is a one-off error in your logic in the final else clause: $linenum is set to 0, but it should be 1, as a line is immediately written to file.

    That said, I’m still not clear on how you could be getting files with, e.g., 299,701 rows. The suggestion of Anonymous Monk that it’s because you skip the empty lines doesn’t persuade, as there are (according to your specification) as many blank lines as there are data entry lines; and your logic ignores blank lines anyway.

    I offer the following in the hope that it may do what you need:

    #!perl use strict; use warnings; use autodie; my $pre = $ARGV[0]; my $max_lines = 300_000; my $linenum = 0; my $filenum = 0; open my $fileout, '>', $pre . '-' . $filenum; while (my $line = <>) { $line =~ s/ \s* $ //x; # remove trailing whitespace (incl. "\ +r\n") if ($line ne '') # ignore blank lines { if ($linenum++ < $max_lines) { print $fileout $line, "\n"; } else { close $fileout; open $fileout, '>', $pre . '-' . ++$filenum; print $fileout $line, "\n"; $linenum = 1; } } } close $fileout;

    HTH,

    Athanasius <°(((><contra mundum

      The suggestion of Anonymous Monk that it’s because you skip the empty lines doesn’t persuade

      And what is your suggestion, why do you think it happens ?

      The code is fairly short and simple, and we only have live4tech's word that there are missing records

      Your reworking of live4tech's code , aside from moving the 300,000-th line into the new file, doesn't change anything else -- if live4tech's original code had records go missing, so would your reworked code (they're virtually identical)

        They're virtually identical

        Except that my pattern matching is (slightly) different.

        As I said, I don’t know why the original code wasn’t working (except for the off-by-one error). At best, the use of \r\n in the pattern match may be a red herring, in which case it will be useful to “eliminate it from our inquiries” (I read too many whodunnits). At worst, it may be introducing some bug which live4tech will find is fixed in my version.

        It will be interesting to find out.

        Athanasius <°(((><contra mundum

Re: Help with problem
by aaron_baugher (Curate) on May 23, 2012 at 13:14 UTC

    A wild guess: Since you reopen FILEOUT to a new file without closing it first, maybe there's an issue with some buffered data ending up in the wrong file? I think perl is usually smart about closing file descriptors in cases like that, but I don't know if you can always count on it. Perhaps you should close it before reopening it.

    Aside from that, I'd have to agree with moritz: some issue with testing. Maybe whatever you're using to count the lines in your resulting files doesn't have exactly the same definition for "line" that perl does. Incidentally, there is a perfectly good *nix utility for this kind of thing:

    grep -v ^$ inputfile | split -d -l 300000 - outputfile

    Aaron B.
    Available for small or large Perl jobs; see my home node.

      A wild guess: Since you reopen FILEOUT to a new file without closing it first, maybe there's an issue with some buffered data ending up in the wrong file? I think perl is usually smart about closing file descriptors in cases like that, but I don't know if you can always count on it. Perhaps you should close it before reopening it.

      reopening a file handle to a different file is just fine. I/O Buffers do get flushed to the disk and the file is closed in the normal way. You can do an explicit close(), but it is not necessary.

      Aaron, that grep line looks promising (short and 'simple' - I like that!). I do not really understand it, but I want to, so I will review grep in perldocs and elsewhere and hopefully be able to decipher the line so I will be able to adapt it to my needs in the future. Thanks so much!

      To everyone else who has commented, thank you too! The logic in the if statements in the original code is correct.

      I am going to try the simpler and prettier code written by Athanasius. BTW - I know the row count is correct because I looked at it in a few ways and checked a number of lines at the beginning, middle and end of several cut files against the original and these were right on. I will update the Monastery after I try the new code.

      One last note - I mentioned I was working through the "Camel" book; well actually its the "Llama" book... sorry Perlers.

        Thanks! Here's that command line explained bit by bit:

        grep grep for -v lines that DO NOT match ^$ an empty line (begin and end with nothing between) inputfile in the file "inputfile" | pipe the results to split the split program, which divides up a file -d naming the output files with digits -l 300000 and putting 300000 lines in each - getting the input from stdin (the pipe) outputfile and naming the output files starting with outputfile (fol +lowed by digits)

        Aaron B.
        Available for small or large Perl jobs; see my home node.

Re: Help with problem
by Anonymous Monk on May 23, 2012 at 05:44 UTC

    I cannot understand how this would happen with the above code!

    Its probably because you skip the empty lines

    Count the empty lines, add the number to the new linecounts, compare to original linecounts