Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: perl ST sort performance issue for large file? (>*memory*GRT)

by BrowserUk (Patriarch)
on Mar 30, 2012 at 15:11 UTC ( [id://962634]=note: print w/replies, xml ) Need Help??


in reply to Re: perl ST sort performance issue for large file? (>GRT)
in thread perl ST sort performance issue for large file?

The problem with that -- as coded -- is that it will require 7 concurrent lists, each the same size as the input array.

Assuming an average 128-byte line length, that means it will need to have room for 175 million items on the stack, for a file that contains 25 million records?

The GRT -- code in the same @out = map() sort map() @in; way would only require 5 times as many concurrent items.

And if the GRT is coded in three steps:

  1. Prefix added as the array is built:
    my @data; $#data = 25e6; my $i = 0; $data[ $i++ ] = pack "NNA*", m[(\d+)\D+(\d+)], $_ while <>;
  2. sort done in-place:
    @data = sort @data;
  3. Prefix trimmed in-place:
    substr $_, 0, 8, '' for @data;

It requires (for this example of two numeric keys) just 8 bytes/record extra memory (~6%).


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

Replies are listed 'Best First'.
Re^3: perl ST sort performance issue for large file? (>*memory*GRT)
by tye (Sage) on Mar 30, 2012 at 15:42 UTC

    More items, smaller items. How that balances out depends (comparing the map+sort+map alternatives).

    I've never seen what you've described proposed under the heading of "GRT" (even your own demonstration of GRT in A brief tutorial on Perl's native sorting facilities. lacks any of these improvements). It looks to be the most memory efficient way I've seen, though it also places several additional restrictions on how it can be used (so is less general-purpose).

    Doing the same optimizations to my technique means 2 large lists instead of 1 large list but the contents of the lists are about the same total size. So, if we have a moderately short key for long records, then your above technique might be 10% overhead while mine (optimized) might be 20% overhead (Perl scalars not being thin on overhead). My technique (optimized) would allow the space for constructed sort keys to be freed when no longer in use (more important for larger keys if the sorted list persists long after the sorting).

    Sorting large records based on a short key, I'm not sure I'd worry about any of your optimizations since even 6 extra small scalars can be a relatively small addition to one huge scalar.

    Thanks for describing this approach. It seems it could make a significant difference in a lot of situations. You should write it up more prominently (or at least update your existing tutorial to cover this).

    - tye        

Re^3: perl ST sort performance issue for large file? (copying)
by tye (Sage) on Mar 30, 2012 at 16:45 UTC

    I was going to mention the advantage of not having to make a copy of each large record but thought maybe you had even avoided that penalty of the classic "decorate the original record" technique. But, looking again, I'm curious about that aspect of:

    $data[ $i++ ] = pack "NNA*", m[(\d+)\D+(\d+)], $_ while <>;

    That still involves an extra copying of each record. For long records, I've certainly seen that add up to a performance problem on some platforms. But that was so very long ago that I also wonder if the cost of copying a large string of bytes has become relatively low compared to the cost of other operations on modern platforms. Actually, no, I have seen that be a problem on a modern platform (in a logging system that kept copying strings repeatedly as more information was added that resulted in logging being the majority cost in the system).

    For fixed-length records, you could avoid the extra copying via:

    for( $data[$i++] ) { read STDIN, $_, $reclen, $keylen; substr( $_, 0, $keylen, pack "NN", m[(\d+)\D+)(\d+)] ); }

    (since "\0"x8 won't match \d -- for other cases, you can set pos and add a /g and a scalar context to the regex)

    I wonder if that would ever make a noticeable difference in performance.

    Given Perl's unfortunate significantly-slower-than-it-should-be implementation of readline (last I checked and except on ancient versions of Unix that are hardly even used these days), you'd probably get a bigger win by unrolling the classic GRT even further and rolling your own readline using large buffers (since sysread with large buffers and split has been repeatedly shown to be about 2x the speed of Perl's implemented-in-C readline, sadly).

    Then you could combine the "copy one record out of the huge buffer" step with the "decorate record with key" step such that together they only copy the record once. But I suspect that would only matter if the cost of reading the records was not insignificant to the cost of sorting the records.

    - tye        

      That still involves an extra copying of each record. For long records, I've certainly seen that add up to a performance problem ...

      On my platform at least, copying the records through a single buffer -- thus not incurring the need to request extra memory from the OS for that particular part of the operation -- doesn't seem to incur much of a penalty.

      On a fairly modest, 50MB/1e6 record file:

      C:\test>dir junk.dat 30/03/2012 17:39 53,777,008 junk.dat C:\test>wc -l junk.dat 1000000 junk.dat C:\test>head junk.dat Some random junk 333740 119750 some more random junk Some random junk 495178 597808 some more random junk Some random junk 815582 448120 some more random junk Some random junk 604675 737548 some more random junk Some random junk 735839 436462 some more random junk Some random junk 89324 842712 some more random junk Some random junk 18493 967895 some more random junk Some random junk 835571 146942 some more random junk Some random junk 783477 720062 some more random junk Some random junk 197540 926361 some more random junk

      Doing the GRT in 3 stages:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time; my @a; $#a = 1e6; my $n = 0; $a[ $n++ ] = pack 'NNA*', m[(\d+) (\d+)], $_ while <>; printf "%d records\n", scalar @a; @a = sort @a; substr( $_, 0, 8, '') for @a; printf "Took %.6f seconds; check memory? ", time()- $start; <STDIN>;

      Takes 7 seconds and just under double the amount of memory required by the original array:

      C:\test>t-GRT junk.dat 1000001 records Took 6.704649 seconds; check memory? 274.1MB

      Doing it with your original formulation unmodified, but excluding the initial time to load the array from the equation:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my @a; $#a = 1e6; my $n = 0; $a[ $n++ ] = $_ while <>; printf "%d records; check memory", scalar @a; <STDIN>; my $start = time; my @sorted = @a[ map { unpack 'N', substr($_,-4) } sort map { pack 'NNN', $a[ $_ ] =~ m[(\d+) (\d+)], $_ } 0..$#a ]; printf "Took %.6f seconds; check memory? ", time()- $start; <STDIN>;

      The sort takes 30 times longer and 4.5times as much memory as the original array:

      C:\test>t-TYE junk.dat 1000001 records; check memory 143.8MB Took 228.533225 seconds; check memory? 649.0 MB

      Most of that time is spent going to the OS over and over to expand the stack piecemeal to accommodate the lists as they expand.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://962634]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (8)
As of 2024-04-25 11:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found