Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Dear fellow monks,

I have a design/ tactical problem for you.

Intro (skippable)

I'm analysing some large stretches of data at the moment. I have been writing some scripts to fetch data from a mere 2 gigabyte of text-files, writing needed info to a 220 meg binary file, and chopped that one in two. All sweet, but than I thought, well, it must be more efficient to make a sorted file with perl, and access the data via perl from matlab than to let matlab sort 'm. Just because the matrix is too large to fit in real memory, and I'm not allowed to buy more megs for my box.

Problem

So I want to sort some 105 megs of binary data on a 16 bits integer. Record length is 8 bytes ('SLS'). When I just read it in, perl dies with 'Out of memory'. Rather strange, because I only have 13 M items, and 256M physical memory and 128 M swap. I guess the sort expands the int's to 8 byte floats, but even that should fit. However, I just accept that.

Solution 1: Changing the perl commands

Here is the code I use to sort, I hope you can pinpoint me a problem:
#!/usr/bin/perl -w #constants $record_size = 8; $amp_index = 6; $amp_format = 'S'; $time_length = 6; $sorted = $big=$ARGV[0]; $sorted =~ s/(\.bin)$/_sorted$1/; #load the data from $big open DATA, $big; binmode DATA; my $amp_a = loadamp(); close DATA; die "Could not read data from $big" unless @{$amp_a->[0]}; print "Ready with loading\n"; #sort on first item: amp my @sortamp = sort{ $a->[0] <=> $b->[0] } @$amp_a; #print out in the 'bigfile' order ([0] contains amp, [1] string with t +ime-info open OUT, ">$sorted"; binmode OUT; for $amp( @sortamp ){ print OUT $amp->[1].pack( $amp_format, $amp->[0]); } close OUT; sub loadamp{ my $str; my @amp_a; my $n = 1; do { $n = sysread( DATA, $str, $record_size); sysseek( DATA, $record_size, 1); (my $amp)=unpack( $amp_format, substr( $str, $amp_index, 2) ); push(@amp_a, [$amp,substr( $str, 0, $time_length)]) if $n; } while ( $n); \@amp_a; }

Solution 2: Sort in file

This could be very inefficient, because you will have to read and write 100 meg chunks many times again.

Solution 2: Seek in file

Maybe I could use Search::binary to find ints of a certain value, and repeat that for all values of int in the file.

Solution 3: Use known distribution of ints

I know how by approximation how the ints are distributed. So I can start a brand new file, fill in the expected boundaries, and move them as seems fit.

Solution 4: External module/ (C) library

Maybe I'd better use some other library to sort my ints. They could be more memory efficient. But I'm really lost here. Couldn't find any reference to binary-sorting in CPAN or at perlmonks.org.

I hope you can appreciate this problem, and I'm curious to any idea's, suggestions, etc. Please point me some nice direction!

In high regard of you all,

Jeroen
"We are not alone"(FZ)


In reply to Sorting data that don't fit in memory by jeroenes

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others avoiding work at the Monastery: (3)
    As of 2020-10-24 07:04 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      My favourite web site is:












      Results (242 votes). Check out past polls.

      Notices?