http://qs321.pair.com?node_id=954120

shoness has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I've run some jobs that print out statistics about memory usage during their life. I use "grep" to collect relevant data into a single file. I parse that file with Perl to pull out the data that I want and create a comma-separated-values file that I can operate on with a spreadsheet tool.

Each line of the output file contains one of the data elements I want to collect. There is lots of other data that I don't want. The input data looks like this:

... Pass #123 ... ... Elapsed Time : 1753.2 sec CPU Time : 753.2 sec ... Virtual memory size : 4472.6 MB Resident set size : 4362 MB ... Major page faults : 7153 ... Pass #124 ... ...

From this data, I expect to create a line like this:

... 123, 1753.2, 753.2, 4472.6, 4362, 7153 ...

As you can see, it's sortof taking the original source data and turning it sideways, stripping off the descriptive and unwanted text. The position in the line tells me that.

My working solution is below. I just think that since "tmtowtdi", that "tMBABwtdi. I'd love to see your thoughts on what surely must be a very common task.

Thanks!

use strict; use warnings; sub slurp { local $/ = undef; local *file; open file, $_[0] or die "Can't open $_[0]: $!"; my $slurp = <file>; close file or die "Can't close $_[0]: $!"; $slurp; } my $indata = slurp('noa.txt'); print "pass, wall time (sec), CPU time (sec), VM (MB), ResMem (MB), Pa +ge Faults\n"; while ($indata =~ m/^Pass\s\#(\d+).*? ^Elapsed\ Time\s+:\s+([\d\.]+).*? ^CPU\ Time\s+:\s+([\d\.]+).*? ^Virtual\ memory\ size\s+:\s+([\d\.]+).*? ^Resident\ set\ size\s+:\s+([\d\.]+).*? ^Major\ page\ faults\s+:\s+([\d\.]+) /msgcx) { print "$1, $2, $3, $4, $5, $6\n"; }

Replies are listed 'Best First'.
Re: Multiline RegExp. A Better Way?
by NetWallah (Canon) on Feb 16, 2012 at 04:54 UTC
    This one is light on memory, and does not need modules .

    Basically, it uses the "Pass #" as a record delimiter.

    use strict; use warnings; my $file = "data1.txt"; local $/ = "\nPass #"; open my $fh, '<', $file or die "Can't open $file: $!"; while ( <$fh>){ print join(",", m/(\d+).*? ^Elapsed\ Time\s+:\s+([\d\.]+).*? ^CPU\ Time\s+:\s+([\d\.]+).*? ^Virtual\ memory\ size\s+:\s+([\d\.]+).*? ^Resident\ set\ size\s+:\s+([\d\.]+).*? ^Major\ page\ faults\s+:\s+([\d\.]+) /msgcx) , "\n"; } close $fh;

                “PHP is a minor evil perpetrated and created by incompetent amateurs, whereas Perl is a great and insidious evil perpetrated by skilled but perverted professionals.”
            ― Jon Ribbens

Re: Multiline RegExp. A Better Way?
by kennethk (Abbot) on Feb 16, 2012 at 02:46 UTC
    First off, the best code is code that works. As your code functions, everything else is necessarily subjective.

    That said, I would modify your slurp function to look like:

    sub slurp { my $file = shift; local $/; open my $fh, '<', $file or die "Can't open $file: $!"; return <$fh>; }

    The big changes here are swapping to indirect filehandles and 3 argument open. Indirect filehandles are guaranteed not to collide with a previously existing file handle and will automatically close out once the variable goes out of scope. See Indirect Filehandles in perlopentut for more virtues. 3-argument open doesn't really matter in this case, but it protects you from some malicious vectors so it's usually considered a good habit to get into. I removed the = undef from your $/ localization, since that is redundant. I explicitly named the input parameter and explicitly returned to make intent more obvious to the casual reader.

    Since you are outputting CSV, I would also likely use an explicit CSV module, such as Text::CSV. Again, it doesn't matter in this case, but it handles escaping which may matter to you in the future.

    Finally, rather than using a while with a long multiline regex, I'd probably either do a streaming parser or a split. The longer the regex, the easier it is to break and the harder it is to fix. But, as far as they go, yours is pretty clean.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.
Re: Multiline RegExp. A Better Way?
by kcott (Archbishop) on Feb 16, 2012 at 03:09 UTC

    Here's another way. It may be a better way. I've used Tie::File so that, even with very large input files, you won't run into any memory issues.

    use strict; use warnings; use Tie::File; my $infile = q{noa.txt}; my $wanted_re = qr{ \A (?> ( Pass \s \# | Elapsed \s Time \s+ : \s+ | CPU \s Time \s+ : \s+ | Virtual \s memory \s size \s+ : \s+ | Resident \s set \s size \s+ : \s+ | Major \s page \s faults \s+ : \s+ ) ( [\d.]+ ) ) }msx; my $last_stat_re = qr{ \A Major \s page \s faults \s+ : \s+ \z }msx; tie my @indata, q{Tie::File}, $infile or die $!; for my $line (@indata) { next if $line !~ $wanted_re; print $2; print $1 =~ $last_stat_re ? qq{\n} : q{, }; } untie @indata;

    I dummied up some additional input to what you provided:

    $ cat noa.txt ... Pass #123 ... ... Elapsed Time : 1753.2 sec CPU Time : 753.2 sec ... Virtual memory size : 4472.6 MB Resident set size : 4362 MB ... Major page faults : 7153 ... Pass #the salt ... Pass #124 ... ... Elapsed Time : 9753.2 sec CPU Time : 953.2 sec ... Virtual memory size : 9472.6 MB Resident Evil Resident set size : 9362 MB ... Major page faults : 9153 ...

    Here's the output:

    $ noa.pl 123, 1753.2, 753.2, 4472.6, 4362, 7153 124, 9753.2, 953.2, 9472.6, 9362, 9153

    -- Ken

Re: Multiline RegExp. A Better Way?
by sundialsvc4 (Abbot) on Feb 16, 2012 at 14:04 UTC

    While it might be overkill in this case, as an aside I would mention Parse::RecDescent which is an excellent tool to use for “complicated” inputs that exhibit a predictable structure.   It has a bit of a learning curve associated with it, but it really “moves the freight” in the right situations.   I’ve asked it to do things that were deserving of overtime and maybe a disability claim, and it just did ’em.