Hi!
I have very large text files (10GB to 15GB). Each one contains about 40 million lines and 7 fields (see example below).
File looks like this:
99_999_852_F3 chr9 97768833 97768867 ATTTTCTTCAATTACATTTCC
+AATGCTATCCCAAA + 35
99_999_852_F3 chr9 97885645 97885679 ATTTTCTTCAaTTACATTTCC
+AATGCTATCCCAAA + 35
99_99_994_F3 chr10 47028821 47028855 AGACAAAAAGGCCATCAACAG
+ATCAGTAAAGGATC + 35
...
I need to sort the files based on field-1 (ASCII sorting). I am using Unix
sort -k1 command. Although it works fine, it takes very long time, 30 min to 1 hour. I also tried following Perl script:
#!/usr/bin/perl
use strict;
use warnings;
open (INFILE, "inputfile.txt") or die $!;
open (OUTFILE, '>', "sorted.txt") or die $!;
foreach (sort <INFILE>){
print OUTFILE $_;
}
close(OUTFILE);
close(INFILE);
exit;
However, this script puts entire file into memory and sorting process becomes too slow. I was wondering if someone could suggest me a Perl script that will do the sorting faster than Unix sort -k1 command, and will not use too much memory. Thanks.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.
|