Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
perl internal sortsv function is much more advanced...

Possibly so, but still you're comparing C-code to C-code. I said Perl -v- C.

The main problem with Perl's sort is that you have to build an array (or list, but that's worse) first. For a 40 million record file that is going to require 3.6GB(*). And that's beyond any 32-bit OS to handle. And on the basis of the OPs example records, that is only a 4GB file.

So for the 15GB ~165 million record file you're talking a memory requirement--just to hold the data--of ~16GB(*)(assuming 13-byte key and 64-bit offset). Which is beyond the realms of most current 64-bit commodity hardware.

(*)Minimum; much more if you store the entire record (even compressed--when you'd also have to factor in the compression and decompression steps), rather than just keys and offsets. And if you go for the RAM-minimising key+offset method, then you have to re-read the entire input file to construct the output file. Another overhead.

it just uses a very naïve merge sort

It would be interesting to know what you mean by naive in this context?

But I think that you may be missing the point, that the main advantage of most system sort utilities is that they know how to use temporary spill files to avoid running out of memory. Something that is impossible using Perl's internal sort.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: sorting very large text files by BrowserUk
in thread sorting very large text files by rnaeye

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-23 20:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found