Re^3: Help performing "random" access on very very large file

One very effective optimisation for reducing the size of the index, is to store just lengths rather than full absolute offsets for most of the lines.

Say for example, that you know that all the lines in your 500GB file are less than 255 characters long, and for calculation purposes, average 100 characters. Then if you build a full, absolute index using the pack 'd' template (required for anything over 4GB), then your index itself is going to be 500GB / 100 * 8 ~= 40 GB!

However, if you store the absolute index of every (say) 1000th line, and 999 x 1-byte lengths for the intervening lines, the index size becomes: 500 GB / (100*1000) * 1007 ~= 5GB.

Each record within your index is then a fixed 1007 bytes, and the process of locating any individual line becomes:

my( $recNo, $offset ) = ( int( $lineNo / 1000 ), ( $lineNo % 1000 ) - 1 );
seek INDEX, $recno *1007, 0;
read INDEX, $idxBuf, 1007;
my $absOffset = unpack 'd', $idxBuf;
my( $addOffset, $length ) = unpack "x[d] %32C$offset C", $idxBuf;
seek BIGFILE, $absOffset + $addOffset, 0;
read BIGFILE, $line, $length;

By using the 'checksum' unpack template (%n-bits) to perform the summing of the intermediate lengths, the calculation of the composite offset is really very fast. This puts any line of your huge file just two reads away.

Note: There is nothing special about the number 1000 I've used in this example except that it is convenient for calculations. As we are calculating a 32-bit offset (%32C), and representing length with 8-bit unsigned chars, the theoretical maximum, absolute offset spacing is 2**32 / 2**8 == 2**24 == 16 MB.

A quick test suggests that it takes Perl ~ 60 milliseconds to calculate the sum of 16 MB chars. That's far less time than having to read even 1 extra data record.

However, there is little extra saving to be had by further decreasing the frequency of the absolute offsets.

Using an absolute offset every 1000 lines gives an index size of: 5.035 GB
The checksum time for 999 bytes: 112 microseconds.
Using an absolute offset every 64k lines gives an index size of: 5.00053405761719 GB
The checksum time for 64k bytes: 320 microseconds.
Using an absolute offset every 16MB lines gives an index size of: 5.00000208616257 GB
The checksum time for 16MB bytes: 60 milliseconds.

So, you see there is little extra index space saving by decreasing the frequency of absolute offsets, but there is a considerable calculation penalty in doing so. You also have to factor in the extra time taken to read the larger index records from the index file.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^3: Help performing "random" access on very very large file Select or Download Code


go ahead... be a heretic
	PerlMonks