comment on

If your file has 40e6 records and it takes an hour to sort, the utility is succeeding in reading-sorting-writing 11 records every millisecond. Which is pretty good by anyones standards given the IO involved.

However, the example records you've posted are 96 chars, so 40e6 records would amount to just under 4GB. If your 15GB files contain similar records, then they'll have more like 165 million records. And if they take 1 hour then the utility is reading-sorting-writing 45 records per millisecond.

You simply aren't going to beat the highly optimised C code using Perl.

If you're doing this often enough to need to speed it up, then there are a few possibilities you could consider.

Use more drives.
If you have more than one (local) drive on machine where this is happening, try to ensure that the output file is on a different drive (physical; not partition) to the input file.
Also check the effect of using -T, --temporary-directory=DIR to use a different physical drive for temporaries.
Use faster drives.
If you're doing this often, it might well be worth spending Ł200/Ł300 on a Solid State Disk.
These are orders of magnitude faster than harddisks and could yeild substantial speedup if used correctly.
Use more processes.
If you have multi-core hardware, you might achieve some gains by preprocessing the file to split it into a few smaller sets, sort those on concurrent processes and concatenating the outputs.
Say you have 4 cpus available, and your records are fairly evenly split rangeing from 11_... to 99_..., then you could start (say) three child processes from your perl script, using piped-opens and then feed them records as you read them 11_... thru 39_... to the first; 40_... thru 69_... to the second; and 70_... onwards to the last. You then concatenate the output files from the three sorts to achieve the final goal.
Again, you'd need to experiment with the placement of the output files and number of processes to see what works best.

Anyway, just a few thoughts for consideration.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

In reply to Re: sorting very large text files by BrowserUk
in thread sorting very large text files by rnaeye

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl-Sensitive Sunglasses
	PerlMonks