Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Dear fellow monks,

I have a design/ tactical problem for you.

Intro (skippable)

I'm analysing some large stretches of data at the moment. I have been writing some scripts to fetch data from a mere 2 gigabyte of text-files, writing needed info to a 220 meg binary file, and chopped that one in two. All sweet, but than I thought, well, it must be more efficient to make a sorted file with perl, and access the data via perl from matlab than to let matlab sort 'm. Just because the matrix is too large to fit in real memory, and I'm not allowed to buy more megs for my box.

Problem

So I want to sort some 105 megs of binary data on a 16 bits integer. Record length is 8 bytes ('SLS'). When I just read it in, perl dies with 'Out of memory'. Rather strange, because I only have 13 M items, and 256M physical memory and 128 M swap. I guess the sort expands the int's to 8 byte floats, but even that should fit. However, I just accept that.

Solution 1: Changing the perl commands

Here is the code I use to sort, I hope you can pinpoint me a problem:
#!/usr/bin/perl -w #constants $record_size = 8; $amp_index = 6; $amp_format = 'S'; $time_length = 6; $sorted = $big=$ARGV[0]; $sorted =~ s/(\.bin)$/_sorted$1/; #load the data from $big open DATA, $big; binmode DATA; my $amp_a = loadamp(); close DATA; die "Could not read data from $big" unless @{$amp_a->[0]}; print "Ready with loading\n"; #sort on first item: amp my @sortamp = sort{ $a->[0] <=> $b->[0] } @$amp_a; #print out in the 'bigfile' order ([0] contains amp, [1] string with t +ime-info open OUT, ">$sorted"; binmode OUT; for $amp( @sortamp ){ print OUT $amp->[1].pack( $amp_format, $amp->[0]); } close OUT; sub loadamp{ my $str; my @amp_a; my $n = 1; do { $n = sysread( DATA, $str, $record_size); sysseek( DATA, $record_size, 1); (my $amp)=unpack( $amp_format, substr( $str, $amp_index, 2) ); push(@amp_a, [$amp,substr( $str, 0, $time_length)]) if $n; } while ( $n); \@amp_a; }

Solution 2: Sort in file

This could be very inefficient, because you will have to read and write 100 meg chunks many times again.

Solution 2: Seek in file

Maybe I could use Search::binary to find ints of a certain value, and repeat that for all values of int in the file.

Solution 3: Use known distribution of ints

I know how by approximation how the ints are distributed. So I can start a brand new file, fill in the expected boundaries, and move them as seems fit.

Solution 4: External module/ (C) library

Maybe I'd better use some other library to sort my ints. They could be more memory efficient. But I'm really lost here. Couldn't find any reference to binary-sorting in CPAN or at perlmonks.org.

I hope you can appreciate this problem, and I'm curious to any idea's, suggestions, etc. Please point me some nice direction!

In high regard of you all,

Jeroen
"We are not alone"(FZ)


In reply to Sorting data that don't fit in memory by jeroenes

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-04-19 10:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found