Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: sorting very large text files

by salva (Canon)
on Dec 19, 2009 at 09:37 UTC ( [id://813500]=note: print w/replies, xml ) Need Help??


in reply to Re: sorting very large text files
in thread sorting very large text files

You simply aren't going to beat the highly optimised C code using Perl.
Out of curiosity, I have been looking at the supposedly "highly optimised C code" of GNU sort and the reality is that it is not so highly optimised, it just uses a very naïve merge sort. In comparison, perl internal sortsv function is much more advanced...

Replies are listed 'Best First'.
Re^3: sorting very large text files
by BrowserUk (Patriarch) on Dec 19, 2009 at 11:00 UTC
    perl internal sortsv function is much more advanced...

    Possibly so, but still you're comparing C-code to C-code. I said Perl -v- C.

    The main problem with Perl's sort is that you have to build an array (or list, but that's worse) first. For a 40 million record file that is going to require 3.6GB(*). And that's beyond any 32-bit OS to handle. And on the basis of the OPs example records, that is only a 4GB file.

    So for the 15GB ~165 million record file you're talking a memory requirement--just to hold the data--of ~16GB(*)(assuming 13-byte key and 64-bit offset). Which is beyond the realms of most current 64-bit commodity hardware.

    (*)Minimum; much more if you store the entire record (even compressed--when you'd also have to factor in the compression and decompression steps), rather than just keys and offsets. And if you go for the RAM-minimising key+offset method, then you have to re-read the entire input file to construct the output file. Another overhead.

    it just uses a very naïve merge sort

    It would be interesting to know what you mean by naive in this context?

    But I think that you may be missing the point, that the main advantage of most system sort utilities is that they know how to use temporary spill files to avoid running out of memory. Something that is impossible using Perl's internal sort.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      It would be interesting to know what you mean by naive in this context?

      I mean the algorithm as you will learn it from the usual "introduction to algorithms and data structures" course, without taking in consideration practical issues like having a hierarchical memory with several levels of cache or that data is frequently not completely random.

      I think that you may be missing the point

      Well, actually I didn't want to make any point besides showing my surprise for the suboptimal algorithm used in that particular implementation of the sort command.

      I had always taken for granted, that nothing could be faster than the external command (at least, for big data sets where the required set up overhead gets diluted) but now, I would also check with other approaches as for instance the excellent Sort::External.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://813500]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2024-04-26 03:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found