Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: sorting very large text files

by BrowserUk (Patriarch)
on Dec 19, 2009 at 08:07 UTC ( [id://813491]=note: print w/replies, xml ) Need Help??


in reply to sorting very large text files

If your file has 40e6 records and it takes an hour to sort, the utility is succeeding in reading-sorting-writing 11 records every millisecond. Which is pretty good by anyones standards given the IO involved.

However, the example records you've posted are 96 chars, so 40e6 records would amount to just under 4GB. If your 15GB files contain similar records, then they'll have more like 165 million records. And if they take 1 hour then the utility is reading-sorting-writing 45 records per millisecond.

You simply aren't going to beat the highly optimised C code using Perl.

If you're doing this often enough to need to speed it up, then there are a few possibilities you could consider.

  • Use more drives.

    If you have more than one (local) drive on machine where this is happening, try to ensure that the output file is on a different drive (physical; not partition) to the input file.

    Also check the effect of using -T, --temporary-directory=DIR to use a different physical drive for temporaries.

  • Use faster drives.

    If you're doing this often, it might well be worth spending £200/£300 on a Solid State Disk.

    These are orders of magnitude faster than harddisks and could yeild substantial speedup if used correctly.

  • Use more processes.

    If you have multi-core hardware, you might achieve some gains by preprocessing the file to split it into a few smaller sets, sort those on concurrent processes and concatenating the outputs.

    Say you have 4 cpus available, and your records are fairly evenly split rangeing from 11_... to 99_..., then you could start (say) three child processes from your perl script, using piped-opens and then feed them records as you read them 11_... thru 39_... to the first; 40_... thru 69_... to the second; and 70_... onwards to the last. You then concatenate the output files from the three sorts to achieve the final goal.

    Again, you'd need to experiment with the placement of the output files and number of processes to see what works best.

Anyway, just a few thoughts for consideration.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: sorting very large text files
by salva (Canon) on Dec 19, 2009 at 09:37 UTC
    You simply aren't going to beat the highly optimised C code using Perl.
    Out of curiosity, I have been looking at the supposedly "highly optimised C code" of GNU sort and the reality is that it is not so highly optimised, it just uses a very naïve merge sort. In comparison, perl internal sortsv function is much more advanced...
      perl internal sortsv function is much more advanced...

      Possibly so, but still you're comparing C-code to C-code. I said Perl -v- C.

      The main problem with Perl's sort is that you have to build an array (or list, but that's worse) first. For a 40 million record file that is going to require 3.6GB(*). And that's beyond any 32-bit OS to handle. And on the basis of the OPs example records, that is only a 4GB file.

      So for the 15GB ~165 million record file you're talking a memory requirement--just to hold the data--of ~16GB(*)(assuming 13-byte key and 64-bit offset). Which is beyond the realms of most current 64-bit commodity hardware.

      (*)Minimum; much more if you store the entire record (even compressed--when you'd also have to factor in the compression and decompression steps), rather than just keys and offsets. And if you go for the RAM-minimising key+offset method, then you have to re-read the entire input file to construct the output file. Another overhead.

      it just uses a very naïve merge sort

      It would be interesting to know what you mean by naive in this context?

      But I think that you may be missing the point, that the main advantage of most system sort utilities is that they know how to use temporary spill files to avoid running out of memory. Something that is impossible using Perl's internal sort.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        It would be interesting to know what you mean by naive in this context?

        I mean the algorithm as you will learn it from the usual "introduction to algorithms and data structures" course, without taking in consideration practical issues like having a hierarchical memory with several levels of cache or that data is frequently not completely random.

        I think that you may be missing the point

        Well, actually I didn't want to make any point besides showing my surprise for the suboptimal algorithm used in that particular implementation of the sort command.

        I had always taken for granted, that nothing could be faster than the external command (at least, for big data sets where the required set up overhead gets diluted) but now, I would also check with other approaches as for instance the excellent Sort::External.

Re^2: sorting very large text files
by Anonymous Monk on Dec 20, 2009 at 04:29 UTC

    Use faster drives. If you're doing this often, it might well be worth spending £200/£300 on a Solid State Disk. These are orders of magnitude faster than harddisks and could yeild substantial speedup if used correctly.

    I have vaguely impression that SSD technique is not mature yet. It will be broken after reading 200,000 times. If I remember right, SSD is not a good choice for him.

      I have vaguely impression that SSD technique is not mature yet. It will be broken after reading 200,000 times.

      Things have moved on. A lot!

      Otherwise, it would be very bad news for the owners of the dozens(hundreds?) of models of netbook (like this) that come with the OS installed on the only drive possible which is an SSD.

      If you look at the spec sheet for the SSD I linked, the significant number is Write Endurance (500GB/Day):    >10 (That's > 10 years!).

      And If I remember correctly, the GoogleOS will only run from SSDs. Though that's possibly not a recommendation.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      No. SSDs have a *write* limit, you can read all you want. And there's a per-block limit, so how many times you can write to a disk depends on the disk's size and the file's size. For instance, if your disk can handle 10000 writes per block, then you could rewrite the entire disk that many times. If you're using 1/4th of the disk, you could rewrite your data 40000 times, and so on.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://813491]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-20 00:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found