Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

(tye)Re: Sorting data that don't fit in memory

by tye (Sage)
on Jan 24, 2001 at 19:40 UTC ( #53999=note: print w/replies, xml ) Need Help??

in reply to Re: Re: Sorting data that don't fit in memory
in thread Sorting data that don't fit in memory

See numbers OK; Re: sorting comma separated value file for an example of how to efficiently sort (in terms of memory use and CPU time). You use two arrays, not an array of tiny arrays. Though I don't think you'll end up using this. (:

However, I don't want to sort on the remainder of the string as it is now.

The remainder of the string only enters into it if the first part is the same. Without sorting on the remainder of the string, the order for records with the same 16-bit integer will be "random". So you lose nothing by sorting on that extra data (other than the time involved in comparing those few extra bytes, which seems a net win considering the memory that can be saved).

But you can save a ton of memory by not storing any of your records in memory as follows:

my $maxrecs= 8*1024*1024; # Whatever you determine fits my $recsize= 8; my $sortsize= 2; my $sortoff= 6; # Here is the only memory hog: my $sorton= " "x($maxrecs*$sortsize); my $idx= 0; # Note that I don't use sysread() here as I think the # buffering offered by read() may improve speed: while( $idx < $maxrecs && read(FILE,$rec,$recsize) ) { substr( $sorton, $idx++*$sortsize, $sortsize )= substr( $rec, $sortoff, $sortsize ); } my @idx= sort { substr($sorton,$a*$sortsize,$sortsize) cmp substr($sorton,$b*$sortsize,$sortsize) } 0..($idx-1); for $idx ( @idx ) { seek( FILE, $idx*$recsize, 0 ); sysread( FILE, $rec, $recsize ); print OUT, $rec; # or substr($rec,0,6) }

Personally I'd just figure out how many records you can sort using this modification and sort that many, write the sorted list out. Repeat with a new output file until you have, oh, 64 output files or no more data. Then merge the 64 output files into one. Repeat until you have 64 merged files or no data. Merge the merged files.

For merging I'd use a heap (an efficient way of inserting new items into a partially sorted list such that you can efficiently always pull out the "first" item from the list).

Let me know if you need more details but I suspect the several references already mentioned should cover this.

        - tye (but my friends call me "Tye")

Replies are listed 'Best First'.
Re: (tye)Re: Sorting data that don't fit in memory
by jeroenes (Priest) on Jan 24, 2001 at 20:07 UTC
    Read the note at the bottom

    I'm impressed again. I was just installing BerkeleyDB according to tilly's suggestion. That was a real nice one, as well. I like using existing modules, as opposed to some others@pm. However, your code seems to be exactly what I wanted.

    Well, the size of the string $sorton shouldn't be much of a problem, it only uses (in my case) 2*13M = 26M of memory. Small offer, here.

    I did a check whether my local 'sort' (qsort, apparently) messed with the order of equal keys:

    #!/usr/bin/perl @data=qw(bb bbZZ aaZZ aaSD aaPM aaAA aa); print join " ", sort { substr($a,0,2) cmp substr($b,0,2) } @data; #Result: aaZZ aaSD aaPM aaAA aa bb bbZZ
    So, it doesn't. I'm very happy with that. This, because the remainder of the string is some code for the time, stored in something like 100us accuray, with a max of 2-3 days. You don't want to sort on that, but you also don't want to change the order. I hope this explains my reluctancy to mix it up.

    I guess the key notion here is the collection of my keys in a string, saving the array-overhead.

    Thanks a lot,

    "We are not alone"(FZ)

    Update I'm afraid I was too fast with my happiness. The real memory hog lies in the 0..($idx-1). That gives you 13M item array, that won't fit in memory. Too bad.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://53999]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2020-10-24 07:31 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (242 votes). Check out past polls.