http://qs321.pair.com?node_id=813469

rnaeye has asked for the wisdom of the Perl Monks concerning the following question:

Hi! I have very large text files (10GB to 15GB). Each one contains about 40 million lines and 7 fields (see example below).

File looks like this: 99_999_852_F3 chr9 97768833 97768867 ATTTTCTTCAATTACATTTCC +AATGCTATCCCAAA + 35 99_999_852_F3 chr9 97885645 97885679 ATTTTCTTCAaTTACATTTCC +AATGCTATCCCAAA + 35 99_99_994_F3 chr10 47028821 47028855 AGACAAAAAGGCCATCAACAG +ATCAGTAAAGGATC + 35 ...

I need to sort the files based on field-1 (ASCII sorting). I am using Unix  sort -k1 command. Although it works fine, it takes very long time, 30 min to 1 hour. I also tried following Perl script:

#!/usr/bin/perl use strict; use warnings; open (INFILE, "inputfile.txt") or die $!; open (OUTFILE, '>', "sorted.txt") or die $!; foreach (sort <INFILE>){ print OUTFILE $_; } close(OUTFILE); close(INFILE); exit;

However, this script puts entire file into memory and sorting process becomes too slow. I was wondering if someone could suggest me a Perl script that will do the sorting faster than Unix  sort -k1 command, and will not use too much memory. Thanks.