Sorting data that don't fit in memory

jeroenes has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

I have a design/ tactical problem for you.

Intro (skippable)

I'm analysing some large stretches of data at the moment. I have been writing some scripts to fetch data from a mere 2 gigabyte of text-files, writing needed info to a 220 meg binary file, and chopped that one in two. All sweet, but than I thought, well, it must be more efficient to make a sorted file with perl, and access the data via perl from matlab than to let matlab sort 'm. Just because the matrix is too large to fit in real memory, and I'm not allowed to buy more megs for my box.

Problem

So I want to sort some 105 megs of binary data on a 16 bits integer. Record length is 8 bytes ('SLS'). When I just read it in, perl dies with 'Out of memory'. Rather strange, because I only have 13 M items, and 256M physical memory and 128 M swap. I guess the sort expands the int's to 8 byte floats, but even that should fit. However, I just accept that.

Solution 1: Changing the perl commands

Here is the code I use to sort, I hope you can pinpoint me a problem:

#!/usr/bin/perl -w
#constants
$record_size = 8;
$amp_index = 6;
$amp_format = 'S';
$time_length = 6;

$sorted = $big=$ARGV[0];
$sorted =~ s/(\.bin)$/_sorted$1/;

#load the data from $big
open DATA, $big;
binmode DATA;
my $amp_a = loadamp();
close DATA;

die "Could not read data from $big" unless @{$amp_a->[0]};

print "Ready with loading\n";

#sort on first item: amp
my @sortamp = sort{ $a->[0] <=> $b->[0] } @$amp_a; 

#print out in the 'bigfile' order ([0] contains amp, [1] string with t
+ime-info
open OUT, ">$sorted";
binmode OUT;
for $amp( @sortamp ){
  print OUT $amp->[1].pack( $amp_format, $amp->[0]); 
}
close OUT;

sub loadamp{
  my $str;
  my @amp_a;
  my $n = 1;
  do {
    $n = sysread( DATA, $str, $record_size);
    sysseek( DATA, $record_size, 1);
    (my $amp)=unpack( $amp_format, substr( $str, $amp_index, 2) );
    push(@amp_a, [$amp,substr( $str, 0, $time_length)]) if $n;
  } while ( $n);
  \@amp_a;
}
[download]

Solution 2: Sort in file

This could be very inefficient, because you will have to read and write 100 meg chunks many times again.

Solution 2: Seek in file

Maybe I could use Search::binary to find ints of a certain value, and repeat that for all values of int in the file.

Solution 3: Use known distribution of ints

I know how by approximation how the ints are distributed. So I can start a brand new file, fill in the expected boundaries, and move them as seems fit.

Solution 4: External module/ (C) library

Maybe I'd better use some other library to sort my ints. They could be more memory efficient. But I'm really lost here. Couldn't find any reference to binary-sorting in CPAN or at perlmonks.org.

I hope you can appreciate this problem, and I'm curious to any idea's, suggestions, etc. Please point me some nice direction!

In high regard of you all,

Jeroen
"We are not alone"(FZ)

Back to Seekers of Perl Wisdom