Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Working with fixed length files

by Tux (Canon)
on Apr 28, 2011 at 06:17 UTC ( [id://901716]=note: print w/replies, xml ) Need Help??


in reply to Re: Working with fixed length files
in thread Working with fixed length files

In theory ikegami's unpack approach should be multitudes faster than the substr approach, as unpack is one single OP. This reference approach should be somewhere in between. I'm curious how a Benchmark would relate the three on the original sized files and if disk-io actually minimizes the effect of the parsing speed difference.


Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^3: Working with fixed length files
by BrowserUk (Patriarch) on Apr 28, 2011 at 07:31 UTC

    1. Ike's code assumes a one-to-one correspondence between the two record types.

      Well founded based on the OPs sample, but these type of mainframe 'carded' records often have multiple secondary records to each primary record.

    2. If the OP confirmed that they were one-to-one, then you could also do a single read for both record types and pre-partition also.
    3. The problem with unpack is that the template must be re-parsed for every record.

      And recent fairly extensive additions to the format specifications have taken some toll on performance.

      With these short, simply structured records that doesn't exact too much of a penalty, but with longer, more complex records it can.

    4. The idea of pre-partitioning the input buffer with an array of substr refs is that simply assigning each record into the pre-partitioned buffer effectively does the parsing and splitting.

      I think the technique is worth a mention for its own sake.

    A quick run of the two posted programs over the same file shows mine to be a tad quicker, but insignificantly. If I adjust mine to the same assumptions as Ike's, (or Ike's to the same assumptions as mine), then mine comes in ~20% quicker. Only a couple of seconds on 1e6 lines, but could be worth having for 100e6.

    c:\test>901649-buk 901649.dat >nul Took 9.283 for 1000000 lines c:\test>901649-ike 901649.dat >nul Took 11.305 for 1000000 lines

    Code tested:


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      WOW. I'm surprised. Really. I do understand your code and the way it works, but that it outperforms unpack surprises me.

      Combining the two techniques makes me fantasize about bindcolumns for unpack. I'm convinced that the delay for unpack is not the parsing of the format, but the creation and copying of the scalars on the stack and into the target list.

      /me has more wishes for unpack, like unpacking from a stream that automatically moves forward for all bytes/characters read for the unpack.


      Enjoy, Have FUN! H.Merijn

      Unless one of you can prove my benchmark is wrong, I do see exactly what I expected:

      $ perl test.pl Rate buk ike buk 71.1/s -- -36% ike 111/s 56% -- $

      The DATA section in the script has trailing \r's:


      Enjoy, Have FUN! H.Merijn

        You are benchmarking the code from the original nodes, which as I mentioned, operate on different assumptions.

        Ike's assumption means the while loop only iterates half as many time as it does for mine. The differences you are measuring are down to that.

        If you modify Ike's to read one record at a time and operate upon it conditionally (per my benchmark), or modify mine to read and map the pairs of records into a single pre-partitioned buffer thereby removing the need for the if statment in the loop, then you would be comparing like with like.

        I also tweeked my benchmark code to a) use a fixed size read thereby avoiding the newline search; b) changed the condition of the loop so that I could assign the return from readline directly to the mapped buffer avoiding another copy.

        This was to ensure that the differences being tested were down to the unpack .versus. substr refs, not the ancilliary details of code written to demonstate the technique, not performance.

        For more performance, do away with the substr and read directly into the partitioned scalar:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time; my $rec = chr(0) x 123; my @type3l = split ':', '02:10:33:15:19:10:3:18:6:4'; my $n = 0; my @type3o = map{ $n += $_; $n - $_; } @type3l; my @type3 = map \substr( $rec, $type3o[ $_ ], $type3l[ $_ ] ), 0 .. $# +type3o; my @typeOl = split ':', '02:98:11:9'; $n = 0; my @typeOo = map{ $n += $_; $n - $_; } @typeOl; my @typeO = map \substr( $rec, $typeOo[ $_ ], $typeOl[ $_ ] ), 0 .. $# +typeOo; until( eof() ) { read( ARGV, $rec, 123, 0 ); if( $rec =~ /^03/ ) { print join '/', map $$_, @type3; } else { print join '|', map $$_, @typeO; } } printf STDERR "Took %.3f for $. lines\n", time() - $start;

        And for ultimate performance, switch to binmode & sysread to avoid Windows crlf layer overhead. But it requires other tweaks also and I'm 21 hours into this day already.

        But whatever, you do need to be comparing like with like.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        perl version?

        With your program I get 'x' outside of string in unpack because of the x2, after removing those, I get

        Rate buk ike buk 19.6/s -- -43% ike 34.1/s 74% -- $ perl -e "die $^V" v5.12.2
        On 5.008009 I get
        Rate buk ike buk 22.7/s -- -35% ike 35.1/s 54% --
        Rate buk ike buk 24.9/s -- -47% ike 47.1/s 89% -- $ ..\perl.exe -e " die $^V" v5.14.0
        This is typical win32 mingw/activestate build

        update: Well you didn't copy buk's code exactly, you omitted

        local $/ = \(2 * 122);

        which appears critical
        5.008009 Rate ike buk ike 35.5/s -- -57% buk 83.1/s 134% -- v5.12.2 Rate ike buk ike 33.6/s -- -55% buk 74.4/s 121% -- v5.14.0 Rate ike buk ike 46.3/s -- -48% buk 88.2/s 91% --

      Your third point made me curious. Running the below benchmark doesn't show a serious slowdown for the unpack code:

      Running perl-all test.pl === base/perl5.8.9 5.008009 i686-linux-64int Rate buk ike buk 65.4/s -- -41% ike 110/s 68% -- === base/tperl5.8.9 5.008009 i686-linux-thread-multi-64int-ld Rate buk ike buk 60.8/s -- -37% ike 95.9/s 58% -- === base/perl5.10.1 5.010001 i686-linux-64int Rate buk ike buk 61.9/s -- -39% ike 102/s 65% -- === base/tperl5.10.1 5.010001 i686-linux-thread-multi-64int-ld Rate buk ike buk 55.4/s -- -37% ike 88.4/s 60% -- === base/perl5.12.2 5.012002 i686-linux-64int Rate buk ike buk 63.0/s -- -41% ike 107/s 70% -- === base/tperl5.12.2 5.012002 i686-linux-thread-multi-64int-ld Rate buk ike buk 54.3/s -- -39% ike 88.4/s 63% -- === base/perl5.14.0 5.014000 i686-linux-64int Rate buk ike buk 59.9/s -- -49% ike 117/s 96% -- === base/tperl5.14.0 5.014000 i686-linux-thread-multi-64int-ld Rate buk ike buk 52.8/s -- -49% ike 104/s 97% --

      Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://901716]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-20 07:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found