Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Help performing "random" access on very very large file

by sgt (Deacon)
on Jul 16, 2007 at 15:55 UTC ( [id://626868]=note: print w/replies, xml ) Need Help??


in reply to Help performing "random" access on very very large file

An interesting question. Some slightly different ideas from the excellent ones already given.

Depending on the total number of accesses wrt to the total number of lines (Dave's question to which you didn't really answer), I would build a full index (ikegami's idea) or a shallow index.

with 500 GB and an average line of 1000 bytes say, you still get a huge number of lines 500m (5*10^8), so a full index would be 500m * 4 bytes = 2GB file. A shallow index of say one entry every 50 would occupy only 2*10^9*2/100 i.e 4*10^7 = 40M a resonable number. To go to line n would mean seeking to position int(n / 50) * 4, read the offset and then seek k times the EOL marker (which implements essentially Tie::File logic). A shallow index is interesting when the number of accesses is much less than the total of lines of the main file.

One other idea is to have a couple processes (or more). One would be a say daemon listening on a given port in charge of calculating the actual index based on the shallow index file, in charge of randomness, and eventually giving back a few lines. A simple protocol could be: send index and receive lines, or send number of lines and receive them. If you can arrange having the same big data file on different partitions with different disk controllers you could afford a process per disk say. The second (and main) process would be in charge of the analisis only. You could also implement recording of a session this way, round-robin caching could be a nice optimization.

cheers --stephan
  • Comment on Re: Help performing "random" access on very very large file

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://626868]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-24 20:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found