http://qs321.pair.com?node_id=53945


in reply to Re: Sorting data that don't fit in memory
in thread Sorting data that don't fit in memory

While I'm looking into the possibilities provided by the other posters, these tips don't require more advanced knowledge than I have ;-).

Thank you for warning me about the anonymous array overhead. Will change that immediately.

Furthermore, I'm on little endian, so the string comparison should work just fine. However, I don't want to sort on the remainder of the string as it is now. Will run a test first, and post the results here.

Thanks a lot,

Jeroen
"We are not alone"(FZ)

Update: Must have made some difference in memory usage, but it's still way too much. Made a rough calculation of the memory usage with system monitor (gnome) and this little script:

#!/usr/bin/perl my $size=1E6; my $f = 2; print "Increasing memory print each step by a factor $f\n\n"; while(1){ print "\tCreating array of $size items...Press enter to continue.... +"; my $b = <STDIN>; my @a=1..$size; print "...done. Press enter to continue."; $b = <STDIN>; chomp $b; $b=~ /[qxQX]/ and exit; $size *= $f; }
It's about 44 megabytes for every million items in an array. No way my 12 M records are going to fit in physical memory. I'm back to Radixsort and alike.

Replies are listed 'Best First'.
(tye)Re: Sorting data that don't fit in memory
by tye (Sage) on Jan 24, 2001 at 19:40 UTC

    See numbers OK; Re: sorting comma separated value file for an example of how to efficiently sort (in terms of memory use and CPU time). You use two arrays, not an array of tiny arrays. Though I don't think you'll end up using this. (:

    However, I don't want to sort on the remainder of the string as it is now.

    The remainder of the string only enters into it if the first part is the same. Without sorting on the remainder of the string, the order for records with the same 16-bit integer will be "random". So you lose nothing by sorting on that extra data (other than the time involved in comparing those few extra bytes, which seems a net win considering the memory that can be saved).

    But you can save a ton of memory by not storing any of your records in memory as follows:

    my $maxrecs= 8*1024*1024; # Whatever you determine fits my $recsize= 8; my $sortsize= 2; my $sortoff= 6; # Here is the only memory hog: my $sorton= " "x($maxrecs*$sortsize); my $idx= 0; # Note that I don't use sysread() here as I think the # buffering offered by read() may improve speed: while( $idx < $maxrecs && read(FILE,$rec,$recsize) ) { substr( $sorton, $idx++*$sortsize, $sortsize )= substr( $rec, $sortoff, $sortsize ); } my @idx= sort { substr($sorton,$a*$sortsize,$sortsize) cmp substr($sorton,$b*$sortsize,$sortsize) } 0..($idx-1); for $idx ( @idx ) { seek( FILE, $idx*$recsize, 0 ); sysread( FILE, $rec, $recsize ); print OUT, $rec; # or substr($rec,0,6) }

    Personally I'd just figure out how many records you can sort using this modification and sort that many, write the sorted list out. Repeat with a new output file until you have, oh, 64 output files or no more data. Then merge the 64 output files into one. Repeat until you have 64 merged files or no data. Merge the merged files.

    For merging I'd use a heap (an efficient way of inserting new items into a partially sorted list such that you can efficiently always pull out the "first" item from the list).

    Let me know if you need more details but I suspect the several references already mentioned should cover this.

            - tye (but my friends call me "Tye")
      Read the note at the bottom

      I'm impressed again. I was just installing BerkeleyDB according to tilly's suggestion. That was a real nice one, as well. I like using existing modules, as opposed to some others@pm. However, your code seems to be exactly what I wanted.

      Well, the size of the string $sorton shouldn't be much of a problem, it only uses (in my case) 2*13M = 26M of memory. Small offer, here.

      I did a check whether my local 'sort' (qsort, apparently) messed with the order of equal keys:

      #!/usr/bin/perl @data=qw(bb bbZZ aaZZ aaSD aaPM aaAA aa); print join " ", sort { substr($a,0,2) cmp substr($b,0,2) } @data; #Result: aaZZ aaSD aaPM aaAA aa bb bbZZ
      So, it doesn't. I'm very happy with that. This, because the remainder of the string is some code for the time, stored in something like 100us accuray, with a max of 2-3 days. You don't want to sort on that, but you also don't want to change the order. I hope this explains my reluctancy to mix it up.

      I guess the key notion here is the collection of my keys in a string, saving the array-overhead.

      Thanks a lot,

      Jeroen
      "We are not alone"(FZ)

      Update I'm afraid I was too fast with my happiness. The real memory hog lies in the 0..($idx-1). That gives you 13M item array, that won't fit in memory. Too bad.