Re: fast lookups in files

two things I would try:

1) compression. If you have any experience with bit coding, try to shrink things down a bit. I guess the best way to do this (the only way i've tried) is with inline C. This is very often a way to make data sets much more manageable, though it requires some low level tinkering. Many simple compression algorithms like variable-byte and rice coding are pretty easy to implement (if I can do it, then anyone can!).

2) I dont think binary sort is a good solution here. True there are O(log(n)) lookups which seems good, but if each of these is a disk read, then this is very slow, especially if the distribution of these look ups is very slow and can't benefit from caching. Rather, I would do the following: before your operation, go through the list, using tell every so often, say, after the leading digit in the first column changes. Some way that you have a manageable number of rows. Now create a hash that stores the starting # of bytes offset into the file, and the number of bytes in length in a hash, you can use storable to save this. When looking up a key-value pair, query this hash, then slurp up the associated chunk of the file. (only one disk op) Since this chunk is still sorted you can use binary search, or since it's in memory, a linear scan may be fine too.

Comment on Re: fast lookups in files

Replies are listed 'Best First'.

Re^2: fast lookups in files
by Corion (Patriarch) on Feb 05, 2008 at 14:32 UTC

For the simple things, like packing a number into 4 bytes, there is pack, which does that, and many more things. The usual approach would be to pack all the offsets of the lines (or a better key) into one long string, as four bytes per line:

my $offsets = '';
while (<$fh>) {
    my $offset = tell $fh;
    $offsets .= pack 'N', $offset;
};
[download]

Whether to store the whole file or just the offset of every 10 lines is a matter of (low-level) optimization.

[reply]
[d/l]


laziness, impatience, and hubris
	PerlMonks