http://qs321.pair.com?node_id=813469

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have very large text files (10GB to 15GB). Each one contains about 40 million lines and 7 fields (see example below).

File looks like this: 99_999_852_F3 chr9 97768833 97768867 ATTTTCTTCAATTACATTTCC +AATGCTATCCCAAA + 35 99_999_852_F3 chr9 97885645 97885679 ATTTTCTTCAaTTACATTTCC +AATGCTATCCCAAA + 35 99_99_994_F3 chr10 47028821 47028855 AGACAAAAAGGCCATCAACAG +ATCAGTAAAGGATC + 35 ...

I need to sort the files based on field-1 (ASCII sorting). I am using Unix  sort -k1 command. Although it works fine, it takes very long time, 30 min to 1 hour. I also tried following Perl script:

#!/usr/bin/perl use strict; use warnings; open (INFILE, "inputfile.txt") or die $!; open (OUTFILE, '>', "sorted.txt") or die $!; foreach (sort <INFILE>){ print OUTFILE $_; } close(OUTFILE); close(INFILE); exit;

However, this script puts entire file into memory and sorting process becomes too slow. I was wondering if someone could suggest me a Perl script that will do the sorting faster than Unix  sort -k1 command, and will not use too much memory. Thanks.

Replies are listed 'Best First'.
Re: sorting very large text files
by almut (Canon) on Dec 19, 2009 at 01:27 UTC

    Stick with the system sort. It does what needs to be done to not use too much memory, and this is why it's slow.  The system sort already is optimised C code — don't expect a functionally equivalent Perl solution to be faster for this particular task.

Re: sorting very large text files
by BrowserUk (Patriarch) on Dec 19, 2009 at 08:07 UTC

    If your file has 40e6 records and it takes an hour to sort, the utility is succeeding in reading-sorting-writing 11 records every millisecond. Which is pretty good by anyones standards given the IO involved.

    However, the example records you've posted are 96 chars, so 40e6 records would amount to just under 4GB. If your 15GB files contain similar records, then they'll have more like 165 million records. And if they take 1 hour then the utility is reading-sorting-writing 45 records per millisecond.

    You simply aren't going to beat the highly optimised C code using Perl.

    If you're doing this often enough to need to speed it up, then there are a few possibilities you could consider.

    • Use more drives.

      If you have more than one (local) drive on machine where this is happening, try to ensure that the output file is on a different drive (physical; not partition) to the input file.

      Also check the effect of using -T, --temporary-directory=DIR to use a different physical drive for temporaries.

    • Use faster drives.

      If you're doing this often, it might well be worth spending £200/£300 on a Solid State Disk.

      These are orders of magnitude faster than harddisks and could yeild substantial speedup if used correctly.

    • Use more processes.

      If you have multi-core hardware, you might achieve some gains by preprocessing the file to split it into a few smaller sets, sort those on concurrent processes and concatenating the outputs.

      Say you have 4 cpus available, and your records are fairly evenly split rangeing from 11_... to 99_..., then you could start (say) three child processes from your perl script, using piped-opens and then feed them records as you read them 11_... thru 39_... to the first; 40_... thru 69_... to the second; and 70_... onwards to the last. You then concatenate the output files from the three sorts to achieve the final goal.

      Again, you'd need to experiment with the placement of the output files and number of processes to see what works best.

    Anyway, just a few thoughts for consideration.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      You simply aren't going to beat the highly optimised C code using Perl.
      Out of curiosity, I have been looking at the supposedly "highly optimised C code" of GNU sort and the reality is that it is not so highly optimised, it just uses a very naïve merge sort. In comparison, perl internal sortsv function is much more advanced...
        perl internal sortsv function is much more advanced...

        Possibly so, but still you're comparing C-code to C-code. I said Perl -v- C.

        The main problem with Perl's sort is that you have to build an array (or list, but that's worse) first. For a 40 million record file that is going to require 3.6GB(*). And that's beyond any 32-bit OS to handle. And on the basis of the OPs example records, that is only a 4GB file.

        So for the 15GB ~165 million record file you're talking a memory requirement--just to hold the data--of ~16GB(*)(assuming 13-byte key and 64-bit offset). Which is beyond the realms of most current 64-bit commodity hardware.

        (*)Minimum; much more if you store the entire record (even compressed--when you'd also have to factor in the compression and decompression steps), rather than just keys and offsets. And if you go for the RAM-minimising key+offset method, then you have to re-read the entire input file to construct the output file. Another overhead.

        it just uses a very naïve merge sort

        It would be interesting to know what you mean by naive in this context?

        But I think that you may be missing the point, that the main advantage of most system sort utilities is that they know how to use temporary spill files to avoid running out of memory. Something that is impossible using Perl's internal sort.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      Use faster drives. If you're doing this often, it might well be worth spending £200/£300 on a Solid State Disk. These are orders of magnitude faster than harddisks and could yeild substantial speedup if used correctly.

      I have vaguely impression that SSD technique is not mature yet. It will be broken after reading 200,000 times. If I remember right, SSD is not a good choice for him.

        I have vaguely impression that SSD technique is not mature yet. It will be broken after reading 200,000 times.

        Things have moved on. A lot!

        Otherwise, it would be very bad news for the owners of the dozens(hundreds?) of models of netbook (like this) that come with the OS installed on the only drive possible which is an SSD.

        If you look at the spec sheet for the SSD I linked, the significant number is Write Endurance (500GB/Day):    >10 (That's > 10 years!).

        And If I remember correctly, the GoogleOS will only run from SSDs. Though that's possibly not a recommendation.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        No. SSDs have a *write* limit, you can read all you want. And there's a per-block limit, so how many times you can write to a disk depends on the disk's size and the file's size. For instance, if your disk can handle 10000 writes per block, then you could rewrite the entire disk that many times. If you're using 1/4th of the disk, you could rewrite your data 40000 times, and so on.
Re: sorting very large text files
by salva (Canon) on Dec 19, 2009 at 09:07 UTC
    GNU sort uses the following rule to determine the size of the memory buffer used for the mergesort algorithm:
    buffer_size = min(1/8 * physical_RAM, free_memory)
    That's somewhat conservative, specially if the machine you are using is not very loaded. Increasing that buffer size will make the sorting much faster. For instance
    sort -S 3.5G ...
    Another way to increase the speed of the operation, is to reduce the size of the file using a more compact encoding. For instance, representing numbers in binary format instead of as ASCII strings will reduce its size to 1/5; DNA sequences can be reduced to 1/4; enumerations to 1 or 2 bytes, etc.

    This kind of compacting will introduce "\n" characters in the stream that need to be escaped. A simple way is to perform the following expansion:

    my %expand = ( "\x10" => "\x11\x11", "\x11" => "\x11\x12"); s/([\x10\x11])/$expand{$1}/g;
Re: sorting very large text files
by Khen1950fx (Canon) on Dec 19, 2009 at 02:59 UTC
    I think that you want to read A Fresh Look at Efficient Perl Sorting by Uri Guttman and Larry Rosler. It's the best paper out there that completely explains sort, especially sorting in Perl. There's also a companion module that you'll want to look at: Sort::Maker.
Re: sorting very large text files
by ikegami (Patriarch) on Dec 21, 2009 at 14:43 UTC
    How often do you need to sort these 15GB? Is taking an hour really a problem?
Re: sorting very large text files
by Ea (Chaplain) on Dec 21, 2009 at 09:47 UTC

    As salva just said, I'd be tempted to change the way you access the data. Naively (because this is not my bread and butter), I'd either create an index file of field -> file+line number and sort on that or pre-sort them into similar files so that there's less work to do.

    Not that these are particularly brilliant ideas, but I'm just trying to illustrate the point that there's more than one way to crack this nut and to look in different directions.

    best of luck,

    perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?
      I'd either create an index file of field -> file+line number and sort on that

      Unfortunately it is not as easy as that. If you sort an index containing just the sorting keys and the offsets, you will need a last step where you combine the sorted index with the original file to create the final full sorted output file.

      Doing this step straight ahead, just following the index and seeking into the original file to read every line, would be very, very, very inefficient. Roughly, (as estimated by BrowserUk) 165e6 records * 10ms per seek = 19 days!!!

      A work around is to create the final file in several passes reading the original file sequentially and generating an slice in every pass... not so easy!

Re: sorting very large text files
by talexb (Chancellor) on Dec 21, 2009 at 18:58 UTC

    How about splitting the large file into smaller pieces, sorting the individual pieces, then merging those pieces into a single sorted file?

    The disadvantage is that you'd need at least double the original file size in free space, but you might mind that the total CPU time would be less, since merging would be fairly fast.

    There are lots of cool ways to do it:

    • Split the original file into 2/4/8 pieces based on file size;
    • Split the original file into 2/4/8 pieces by doing a round-robin choice on each line;
    • Sort into individual files based on the first 1/2/3 characters in each line;
    .. and there are lots more possibilities .. that would be a fun project.

    Alex / talexb / Toronto

    Team website: Forex Chart Monkey, Forex Technical Analysis and Pickpocket Prevention

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: sorting very large text files
by MadraghRua (Vicar) on Dec 21, 2009 at 21:11 UTC
    hello rnaeye (nice pun)

    Check out the Samtools project - I think over on sourceforge. Depending upon the aligner you are using, there are either tools to munge the data into a SAM or BAM file or your aligner may already produce one. SAM is the generic alignment format for short reads - BAM is the binary equivalent. Once you have the file into SAM or BAM format, you can use SAM tools to sort and create an index of the short reads. From there you can use the indexes to pull out chromosome specific reads, etc. You'll then need something like a BAM->BED converter to use downstream tools or to display in the genome browser of your choice. I typically find converting a SAM or read file into BAM reduces the size of the file to about 10% of the text size. Of course you then need to be able to read the BAM file and that is where the downstream tools like Picard, Bio-SAMTools (Perl), Pysam and cl-sam come in.

    If you want to roll your own, follow the good monks advice - split the file, parse the aligned reads into different chromosome files and then sort on the individual chromosomes in each file. You can concatenate it all back together in the end.

    Good luck!

    MadraghRua
    yet another biologist hacking perl....

Re: sorting very large text files
by hda (Chaplain) on Dec 21, 2009 at 14:52 UTC
    Isn't this the kind of job where a database system like PostgreSQL, MySQL and the sort will perform better?
    Just wondering.

      No! It'll take longer to just import the data into the RDBMS than your local system sort utility will take to complete.

      And after that, you've still got to index the data, perform the sorting query and output the data sorted.

      Allow for anything from 5 to 10 times as long to achieve the final goal of sorted data in a single file on disk. Maybe more.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk >>[...] from 5 to 10 times as long ... Maybe more.

        Sort will obviously always be faster than using a database (here, PostgreSQL) for loading and querying for the same result, but I want to show that you exaggerate the overhead of loading+quering into an output file. Indexing would be redundant, but I bet even with an index it wouldn't take that much time.

        I constructed a similar file, and sorted by the dna column in the two ways under consideration, (sort versus database slurp+query).

        Result: System sort: 59 minutes. Postgres load (6m) + select (59m): 66 minutes
        $ ls -lh dna2.txt -rw-rw-r-- 1 xxxxx xxxxx 12G Dec 21 16:38 dna2.txt 1588 mm_ref_chrY.fa 2902494 R CTTGACTTTTATGTGTCACTGTGTTCAGTT +CCTGGT 939 mm_ref_chrY.fa 2902495 F CCAGGAACTGACCACAGTGACACATAAGAG +TCAAAT 742 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC +AAATAG 1097 mm_ref_chrY.fa 2902497 F AGGAACTGACCACAGTGACACATAAGAGTC +AAATAG 100 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA +GTAGTG 1184 mm_ref_chrY.fa 2902499 F GAACTGACCACAGTGACACATAAAAGTCAA +GTAGTG 286 mm_ref_chrY.fa 2902505 F ACCACAGTGCCACATAAAAGTCAAGTAGGG +AATCCT 235 mm_ref_chrY.fa 2902513 R ACCAGGACAGGATCCCCTACTTGACTTTTA +TGTGGC 1744 mm_ref_chrY.fa 2902516 F ACATAAAAGTCAAGTAGTGGACCCTGTCCT +GGTCTG 1029 mm_ref_chrY.fa 2902519 F TAAAAGTCAAGTAGGGGATCCTGTCCTGGT +CTGGCA # # bash + sort version: # $ time sort -k5 dna2.txt > dna2.sortk5.out real 59m48.641s # # Postgres version: # # # loading the data: # $ time < dna2.txt \ psql -d test -c " drop table if exists dna_test; create table dna_test( y integer, chromosome text, genomic_location integer, direction text, seq text); copy dna_test from stdin csv delimiter E'\t'; " ; real 6m20.430s echo " select to_char(count(*), '999G999G999') as rowcount from dna_test" | psql -d test rowcount -------------- 181,261,572 (1 row) # # querying a resultset into a sorted file: # $ time echo " copy (select * from dna_test order by seq) to stdout" \ | psql -1qtAd test > dna_test.order_by_seq.out real 59m12.569s

        So:

             unix sort: real    59m48.641s
        table ORDER BY: real    59m12.569s  
        the latter preceded by   6m20.430s  overhead for loading table data
        
        

        So much for your (BrowserUK's) guess: "database takes 5 to 10 times as long as system sort ... Maybe more"...

        It almost amounts to slander :)

        Update 1

        Of course, I couldn't resist to trying with an index as well. and sure enough it's useless / not used:

        $ time echo " create index dna_test_seq_idx on dna_test (seq) " | psql -d test CREATE INDEX real 63m20.451s

        63m to create the index - that makes sense.

        But of course, Pg will not use it:

        $ time echo " explain analyze select * from dna_test order by seq " | psql -1qtAd test > dna_test.order_by_seq.explain_analyze.txt real 57m44.054s $ cat dna_test.order_by_seq.explain_analyze.txt Sort (cost=35550611.84..36003765.76 rows=181261568 width=62) (actual +time=1834228.291..3428773.879 rows=181261572 loops=1) Sort Key: seq Sort Method: external merge Disk: 12933032kB -> Seq Scan on dna_test (cost=0.00..3872406.68 rows=181261568 widt +h=62) (actual time=17.966..65802.621 rows=181261572 loops=1) Total runtime: 3463580.243 ms

        update 2: removed a useless use of cat, lest I receive another uuc-award...

Re: sorting very large text files
by gam3 (Curate) on Jan 06, 2010 at 03:40 UTC
    If the file is not so big that the keys will not fit into memory you can do this:
    #!/usr/bin/perl use strict; use warnings; use vars qw( @data $prev $IN @sorted $OUT); $prev = 0; open($IN, '<', 'data'); while (<$IN>) { chomp; my $next = tell($IN); my ($key) = /^([^\s]*)/; push(@data, [ $key, $prev, $next - $prev ]); $prev = $next; } close($IN); @sorted = sort( { $a->[0] cmp $b->[0] } @data); open($IN, '<', 'data'); open($OUT, '>', 'sorteddata'); for (@sorted) { seek($IN, $_->[1], 0); sysread($IN, my $line, $_->[2]); print $OUT $line; } close($IN); close($OUT);
    On the cooked data I tested, I got the following timings:
    Gnu Sort:
    # time sort --temporary-directory=/opt data > sort1
    
    real    0m24.698s
    user    0m22.539s
    sys     0m1.950s
    
    Perl Sort:
    # time perl sort.pl 
    
    real    0m55.900s
    user    0m39.897s
    sys     0m6.430s
    
    The data file I used had a wc of:
    #wc data
      4915200  34406400 383385600 data
    
    I am surprised that this Perl script is only half the speed of Gnu sort on this data. I think that on a bigger data set, with long lines, it might even be able to sort faster that Gnu Sort.

    UPDATE: Most of the time seems to be being spent in the output loop. All of the seeks seems to really slow things down.

    -- gam3
    A picture is worth a thousand words, but takes 200K.
      I am surprised that this Perl script is only half the speed of Gnu sort ... Most of the time seems to be being spent in the output loop.

      Your file consists of 5 million 78-byte lines. If you assume that the basic block size of your disk is 4k, then:

      • when reading sequentially, even if the file was totally fragmented and required a seek and/or disk rotation between each read, then it requires 93,000 seeks & reads.

      • but when reading randomly, each block will need to be re-sought & read 52 times--because each 4k block contains 52 lines.

        So that's close to 5 million seeks & reads.

      Of course, it won't be quite that bad because your OS will attempt to keep blocks read in it's cache and thereby fulfill subsequent line read requests from previously read blocks from cache. But as the file gets bigger and the contents more disordered, then the odds of it being able to retain blocks long enough for all 52 line requests to be satisfied from 1 read drop off rapidly.

      This is further compounded by the fact that you are writing your records to the output file as you read them, forcing the disk to interleave seeks on the input file with seeks on the output file--unless your /opt directory is on another drive.

      You could probably improve the performance of your code on single disk systems by allocating a large buffer for the output, accumulate the records in that and then write it in one blast. If you try that, make sure that you overwrite the buffer not free and reallocate it.

      Better still would be to allocate two large buffers, accumulate in one, then switch to the other whilst the first is being written. It would require the use of threads or asynchronous IO to make best use of that.

      Ultimately, the fastest external sorts avoid random IO by reading sequential chunks of the file, sorting them in memory, and writing the results to temporary files. These are then merge sorted into the output file. Although the merge sort phase isn't strictly sequential because it requires seeks between the different temporary files, all the records from any one temporary file--and any given 4k block--will be accessed sequentially. That means the system only needs to cache one 4k block per temporary file to avoid having to re-seek & read any given block. A far more likely scenario than being able to cache all the active blocks in a huge file being read randomly.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.