Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Please note that when I wrote that I was more conversant with theory than practice. Since then I've gained some practical experience. (Including sorting a dataset that was too big to store on the disk in uncompressed form!)

The suggestion will work. It is just the wrong solution to use with a large dataset.

Let me explain the problem. Suppose we have, say, 100 million lines to sort. Lines average 70 bytes each. We're doing this on a hard drive which turns at 6000 rpm and which can sustain 100 MB/second. How long does the btree solution take?

Well what is disk seek time? 6000 rpm means 100 revolutions per second. When you choose to seek, the disk could be anywhere, so on average you have half a revolution. So average seek time is 1/200th of a second. We have (ballpark) 100 million writes to make, so 100 million seeks will take us about 500,000 seconds, which is about 5.8 days. (I've left out a number of complicating factors, but still it will take a matter of days to do it.)

Now let's consider sorting on disk using a merge sort. Merge sort is the process of repeatedly taking sorted lists and merging them into longer sorted lists. If you're clever about it (any good merge sort implementation will be), you can arrange to always be merging lists of about the same size, and you can arrange to not have very many lists lying around at once.

Our dataset is 7 GB (100 million lines times 70 bytes per line). With 100 million elements, a merge sort will finish in 27 passes. (There is a power of 2 pattern since with every pass the length of the lists double.) On each pass you have to read and write the data. So that is 7 GB * 27 * 2 = 378 GB. At 100 MB/s that takes 3780 seconds, or about an hour. (Again I've massively oversimplified, but that is about right.)

Sure, both solutions finish, but which solution would you prefer to run?

This is not to say that btrees don't have their place. They do. Particularly if you need to keep a changing dataset in sorted order. But if you want to sort a large dataset one time, they are not optimal. And, in fact, if you want to put a large dataset into a btree, you should first sort it with a mergesort then put it in the btree.


In reply to Re^2: Re (tilly) 1: Sorting data that don't fit in memory by tilly
in thread Sorting data that don't fit in memory by jeroenes

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chilling in the Monastery: (4)
    As of 2020-11-01 01:58 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      My favourite web site is:












      Results (291 votes). Check out past polls.

      Notices?