Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
I am surprised that this Perl script is only half the speed of Gnu sort ... Most of the time seems to be being spent in the output loop.

Your file consists of 5 million 78-byte lines. If you assume that the basic block size of your disk is 4k, then:

  • when reading sequentially, even if the file was totally fragmented and required a seek and/or disk rotation between each read, then it requires 93,000 seeks & reads.

  • but when reading randomly, each block will need to be re-sought & read 52 times--because each 4k block contains 52 lines.

    So that's close to 5 million seeks & reads.

Of course, it won't be quite that bad because your OS will attempt to keep blocks read in it's cache and thereby fulfill subsequent line read requests from previously read blocks from cache. But as the file gets bigger and the contents more disordered, then the odds of it being able to retain blocks long enough for all 52 line requests to be satisfied from 1 read drop off rapidly.

This is further compounded by the fact that you are writing your records to the output file as you read them, forcing the disk to interleave seeks on the input file with seeks on the output file--unless your /opt directory is on another drive.

You could probably improve the performance of your code on single disk systems by allocating a large buffer for the output, accumulate the records in that and then write it in one blast. If you try that, make sure that you overwrite the buffer not free and reallocate it.

Better still would be to allocate two large buffers, accumulate in one, then switch to the other whilst the first is being written. It would require the use of threads or asynchronous IO to make best use of that.

Ultimately, the fastest external sorts avoid random IO by reading sequential chunks of the file, sorting them in memory, and writing the results to temporary files. These are then merge sorted into the output file. Although the merge sort phase isn't strictly sequential because it requires seeks between the different temporary files, all the records from any one temporary file--and any given 4k block--will be accessed sequentially. That means the system only needs to cache one 4k block per temporary file to avoid having to re-seek & read any given block. A far more likely scenario than being able to cache all the active blocks in a huge file being read randomly.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^2: sorting very large text files by BrowserUk
in thread sorting very large text files by rnaeye

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others making s'mores by the fire in the courtyard of the Monastery: (3)
    As of 2021-03-04 06:44 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      My favorite kind of desktop background is:











      Results (98 votes). Check out past polls.

      Notices?