Re^2: sorting very large text files

I am surprised that this Perl script is only half the speed of Gnu sort ... Most of the time seems to be being spent in the output loop.

Your file consists of 5 million 78-byte lines. If you assume that the basic block size of your disk is 4k, then:

when reading sequentially, even if the file was totally fragmented and required a seek and/or disk rotation between each read, then it requires 93,000 seeks & reads.
but when reading randomly, each block will need to be re-sought & read 52 times--because each 4k block contains 52 lines.
So that's close to 5 million seeks & reads.

Of course, it won't be quite that bad because your OS will attempt to keep blocks read in it's cache and thereby fulfill subsequent line read requests from previously read blocks from cache. But as the file gets bigger and the contents more disordered, then the odds of it being able to retain blocks long enough for all 52 line requests to be satisfied from 1 read drop off rapidly.

This is further compounded by the fact that you are writing your records to the output file as you read them, forcing the disk to interleave seeks on the input file with seeks on the output file--unless your /opt directory is on another drive.

You could probably improve the performance of your code on single disk systems by allocating a large buffer for the output, accumulate the records in that and then write it in one blast. If you try that, make sure that you overwrite the buffer not free and reallocate it.

Better still would be to allocate two large buffers, accumulate in one, then switch to the other whilst the first is being written. It would require the use of threads or asynchronous IO to make best use of that.

Ultimately, the fastest external sorts avoid random IO by reading sequential chunks of the file, sorting them in memory, and writing the results to temporary files. These are then merge sorted into the output file. Although the merge sort phase isn't strictly sequential because it requires seeks between the different temporary files, all the records from any one temporary file--and any given 4k block--will be accessed sequentially. That means the system only needs to cache one 4k block per temporary file to avoid having to re-seek & read any given block. A far more likely scenario than being able to cache all the active blocks in a huge file being read randomly.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Comment on Re^2: sorting very large text files


go ahead... be a heretic
	PerlMonks