Re: Help performing "random" access on very very large file

An interesting question. Some slightly different ideas from the excellent ones already given.

Depending on the total number of accesses wrt to the total number of lines (Dave's question to which you didn't really answer), I would build a full index (ikegami's idea) or a shallow index.

with 500 GB and an average line of 1000 bytes say, you still get a huge number of lines 500m (5*10^8), so a full index would be 500m * 4 bytes = 2GB file. A shallow index of say one entry every 50 would occupy only 2*10^9*2/100 i.e 4*10^7 = 40M a resonable number. To go to line n would mean seeking to position int(n / 50) * 4, read the offset and then seek k times the EOL marker (which implements essentially Tie::File logic). A shallow index is interesting when the number of accesses is much less than the total of lines of the main file.

One other idea is to have a couple processes (or more). One would be a say daemon listening on a given port in charge of calculating the actual index based on the shallow index file, in charge of randomness, and eventually giving back a few lines. A simple protocol could be: send index and receive lines, or send number of lines and receive them. If you can arrange having the same big data file on different partitions with different disk controllers you could afford a process per disk say. The second (and main) process would be in charge of the analisis only. You could also implement recording of a session this way, round-robin caching could be a nice optimization.

cheers --stephan

Comment on Re: Help performing "random" access on very very large file


Problems? Is your data what you think it is?
	PerlMonks