Re^2: sorting very large text files

Replies are listed 'Best First'.
Re^3: sorting very large text files by BrowserUk (Patriarch) on Dec 19, 2009 at 11:00 UTC
perl internal sortsv function is much more advanced... Possibly so, but still you're comparing C-code to C-code. I said Perl -v- C. The main problem with Perl's sort is that you have to build an array (or list, but that's worse) first. For a 40 million record file that is going to require 3.6GB(). And that's beyond any 32-bit OS to handle. And on the basis of the OPs example records, that is only a 4GB file. So for the 15GB ~165 million record file you're talking a memory requirement--just to hold* the data--of ~16GB()(assuming 13-byte key and 64-bit offset). Which is beyond the realms of most current 64-bit commodity hardware. ()Minimum; much more if you store the entire record (even compressed--when you'd also have to factor in the compression and decompression steps), rather than just keys and offsets. And if you go for the RAM-minimising key+offset method, then you have to re-read the entire input file to construct the output file. Another overhead. it just uses a very naīve merge sort It would be interesting to know what you mean by naive in this context? But I think that you may be missing the point, that the main advantage of most system sort utilities is that they know how to use temporary spill files to avoid running out of memory. Something that is impossible using Perl's internal sort. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "I'd rather go naked than blow up my ass"	[reply]
Re^4: sorting very large text files by salva (Canon) on Dec 21, 2009 at 09:20 UTC
It would be interesting to know what you mean by naive in this context? I mean the algorithm as you will learn it from the usual "introduction to algorithms and data structures" course, without taking in consideration practical issues like having a hierarchical memory with several levels of cache or that data is frequently not completely random. I think that you may be missing the point Well, actually I didn't want to make any point besides showing my surprise for the suboptimal algorithm used in that particular implementation of the `sort` command. I had always taken for granted, that nothing could be faster than the external command (at least, for big data sets where the required set up overhead gets diluted) but now, I would also check with other approaches as for instance the excellent Sort::External.	[reply] [d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks