How to get fast random access to a large file?

gothic_mallard has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm attempting to find a fast way to manipulate pretty large files (well, anything from like 100k to 2Gb).

As a quick run down - the files themselves containg a mini-markup language for driving laser printers. Each line in the file (\n delimited - MS Windows) is a separate instruction. The lines are then grouped into the commands to create a specific page and then the pages are grouped into sets of related pages. (These all get represented by objects that cache the data as it's discovered and make data extraction easier).

To cut a long story short, I need a method of being able to navigate around the file in as effeicient and speedy a manner possible (speed is probably more of a consideration than efficiency (memory usage et al) in this case).

Currently I'm using Tie::File but I'm not sure if this is the best way. I have the problem really that, if I want a line near the start of the file it gets returned pretty quickly, but if it's near the end it's taking a fair amount of time.

I was thinking about IO::File, but then to able to directly get a line I'd need to index the file first (else I don't know where to seek to (the lines are all variable in length)).

There are a few likely looking modules on CPAN but never having used them I'm not familiar with their strengths / weaknesses so I'd value some opinions.

Any code that can read the file also needs to be able to write to it so that the file may be amended - currently this gets done by hand in something like UltraEdit and is fairly clunky so I'm hoping what I'm developing will take some of the pain out of it :)

If I haven't covered something here adequately enough just let me know and I'll try to clarify :)

This is all based on MS Windows 2000/XP desktops and servers running ActivePerl 5.6.1 (build 633).

Thanks in advance,

Quick aside:
Just wondering if there's any reason why all my replies just got downvoted? :-?

Thanks all for the advice so far though. Sticking with Tie::File looks like getting into some kind of indexing. Is Tie::File the best solution here though (short of reading the thing into a db which I would if I could :)) or are there modules out there more suited to the task? I saw File::RandomAccess but it doesn't appear to be available via ActiveState PPM so it'd be a nightmare getting onto machines here.

--- Jay

All code is untested unless otherwise stated.
All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
If in doubt ask.

Comment on How to get fast random access to a large file? Download Code

Replies are listed 'Best First'.

Re: How to get fast random access to a large file?
by BrowserUk (Patriarch) on Oct 29, 2004 at 12:46 UTC

Are you modifying the file?

Have you tried setting the memory parameter when you tie the file? The default is 20MB, increasing this according to how much ram you have may improve performance.

The thing you have to remember is that in order to read the last line of a variable length record file, you *have* to read all the intermediate ones along the way. At least the first time. After that T::F will remember where the lines are, provided remembering doesn't require more than the memory limit specified. Once that memory limit is exhausted, it has to start forgetting things, which then requires re-discovery if you revisit those forgotten lines later.

It takes 128 MB of raw binary storage to remember the offsets of all 33,554,432 32-character lines in a 1 GB file. That's storing the offsets in 4-bytes binary. Tie::File uses a hash to store the offsets, which requires considerably more memory. All of which is my way of saying, Tie::File is very good, but it can't work miracles; and if you are working on files bigger than a couple of hundred MB, you must increase the memory parameter value.

If you are modifying the lines, that will slow things down. A lot if you are modifying randomly throughout the file.

Also, you can construct your own index file for the record offsets quite easily. It means you can use substantially less ram for the index overhead and still achieve very fast random access. It takes a bit of work, but if your interested /msg me.

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

[reply]
[d/l]

Re: How to get fast random access to a large file?
by tilly (Archbishop) on Oct 29, 2004 at 15:55 UTC

If you make an edit near the middle that changes the length of a line, the rest of the file has to be rewritten. It takes time to write a GB of data, and nothing you can do will change that fact. If the files in question are on a shared drive this will be both slow and you'll want to be very careful about locking issues or else you could find yourself losing edits.

I'd strongly suggest looking at the file format and deciding whether you can find some kind of "filler" to even things out. For instance maybe the format allows for comments somewhere. That would let you rewrite the file once and then deal with it as a fixed record length format afterwards. Even with the complexity of having to sanity check that lines appear to start where they should (someone might edit by hand...), this would make your life amazingly easier.

If you can't do that, then any solution that you come up with will invariably suck. But it won't be your fault, it will be a result of the artificial limitations that you have to live under. Not that that will make you feel much better when people complain...

[reply]