Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

How to get fast random access to a large file?

by gothic_mallard (Pilgrim)
on Oct 29, 2004 at 11:51 UTC ( [id://403695]=perlquestion: print w/replies, xml ) Need Help??

gothic_mallard has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm attempting to find a fast way to manipulate pretty large files (well, anything from like 100k to 2Gb).

As a quick run down - the files themselves containg a mini-markup language for driving laser printers. Each line in the file (\n delimited - MS Windows) is a separate instruction. The lines are then grouped into the commands to create a specific page and then the pages are grouped into sets of related pages. (These all get represented by objects that cache the data as it's discovered and make data extraction easier).

To cut a long story short, I need a method of being able to navigate around the file in as effeicient and speedy a manner possible (speed is probably more of a consideration than efficiency (memory usage et al) in this case).

Currently I'm using Tie::File but I'm not sure if this is the best way. I have the problem really that, if I want a line near the start of the file it gets returned pretty quickly, but if it's near the end it's taking a fair amount of time.

I was thinking about IO::File, but then to able to directly get a line I'd need to index the file first (else I don't know where to seek to (the lines are all variable in length)).

There are a few likely looking modules on CPAN but never having used them I'm not familiar with their strengths / weaknesses so I'd value some opinions.

Any code that can read the file also needs to be able to write to it so that the file may be amended - currently this gets done by hand in something like UltraEdit and is fairly clunky so I'm hoping what I'm developing will take some of the pain out of it :)

If I haven't covered something here adequately enough just let me know and I'll try to clarify :)

This is all based on MS Windows 2000/XP desktops and servers running ActivePerl 5.6.1 (build 633).

Thanks in advance,

Quick aside:
Just wondering if there's any reason why all my replies just got downvoted? :-?

Thanks all for the advice so far though. Sticking with Tie::File looks like getting into some kind of indexing. Is Tie::File the best solution here though (short of reading the thing into a db which I would if I could :)) or are there modules out there more suited to the task? I saw File::RandomAccess but it doesn't appear to be available via ActiveState PPM so it'd be a nightmare getting onto machines here.

--- Jay

All code is untested unless otherwise stated.
All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
If in doubt ask.

Replies are listed 'Best First'.
Re: How to get fast random access to a large file?
by BrowserUk (Patriarch) on Oct 29, 2004 at 12:46 UTC

    Are you modifying the file?

    Have you tried setting the memory parameter when you tie the file? The default is 20MB, increasing this according to how much ram you have may improve performance.

    The thing you have to remember is that in order to read the last line of a variable length record file, you *have* to read all the intermediate ones along the way. At least the first time. After that T::F will remember where the lines are, provided remembering doesn't require more than the memory limit specified. Once that memory limit is exhausted, it has to start forgetting things, which then requires re-discovery if you revisit those forgotten lines later.

    It takes 128 MB of raw binary storage to remember the offsets of all 33,554,432 32-character lines in a 1 GB file. That's storing the offsets in 4-bytes binary. Tie::File uses a hash to store the offsets, which requires considerably more memory. All of which is my way of saying, Tie::File is very good, but it can't work miracles; and if you are working on files bigger than a couple of hundred MB, you must increase the memory parameter value.

    If you are modifying the lines, that will slow things down. A lot if you are modifying randomly throughout the file.

    Also, you can construct your own index file for the record offsets quite easily. It means you can use substantially less ram for the index overhead and still achieve very fast random access. It takes a bit of work, but if your interested /msg me.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: How to get fast random access to a large file?
by tilly (Archbishop) on Oct 29, 2004 at 15:55 UTC
    Your edits are guaranteed to be slow.

    If you make an edit near the middle that changes the length of a line, the rest of the file has to be rewritten. It takes time to write a GB of data, and nothing you can do will change that fact. If the files in question are on a shared drive this will be both slow and you'll want to be very careful about locking issues or else you could find yourself losing edits.

    I'd strongly suggest looking at the file format and deciding whether you can find some kind of "filler" to even things out. For instance maybe the format allows for comments somewhere. That would let you rewrite the file once and then deal with it as a fixed record length format afterwards. Even with the complexity of having to sanity check that lines appear to start where they should (someone might edit by hand...), this would make your life amazingly easier.

    If you can't do that, then any solution that you come up with will invariably suck. But it won't be your fault, it will be a result of the artificial limitations that you have to live under. Not that that will make you feel much better when people complain...

Re: How to get fast random access to a large file?
by Happy-the-monk (Canon) on Oct 29, 2004 at 12:17 UTC

      Yep, tried that one :) I'm sure I can tweak Tie::File to an extent but just wondering if there are any other ways?

      --- Jay

      All code is untested unless otherwise stated.
      All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
      If in doubt ask.

Re: How to get fast random access to a large file?
by dragonchild (Archbishop) on Oct 29, 2004 at 12:40 UTC
    This sounds like you might want to look at a database. You could have a table that looked something like:
    ColumnType
    IDINT
    InstructionVARCHAR
    NextINT

    So, editing the files isn't going to be an issue because you have a linked list of instructions. Pulling the list of instructions out is going to be a little more annoying, but you shouldn't have to do that very often.

    Of course, this is just a first-pass at the problem. A few more discussions and we can have a better schema for you.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      I really need to leave the files in-situ and I'm stuck with the format I've got unfortuantly. I also can't guarantee that everyone using the utility that will come from this development will have access to a database (I don't have control over that sort of thing here and they're tight enough about access - most people's machine's here don't even let them change the desktop wallpaper let alone have a connection to one of the database servers...)

      --- Jay

      All code is untested unless otherwise stated.
      All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
      If in doubt ask.

Re: How to get fast random access to a large file?
by fglock (Vicar) on Oct 29, 2004 at 12:42 UTC

    Depending on your application, you could use fixed length records. This format can be edited in a text editor (carefully) and it also provides random access:

    some text # more lines # a very big # line split # into three #

    another markup for overflowing lines:

    some text # more lines # a very big \ line split \ into three #

      Unfortuantly I don't have any control over the structure of the data file as it's autogenerated by software written by someone else and used by other systems.

      The content I can change (like, if it says to print "Mr J Bloggs", change it to "Mrs A Nonimouse") but I can't change the representation.

      --- Jay

      All code is untested unless otherwise stated.
      All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
      If in doubt ask.

Re: How to get fast random access to a large file?
by PodMaster (Abbot) on Oct 29, 2004 at 12:43 UTC
    Change the file format, maintain an index. Add a header, something like
    Line Offsets: 20, 55, 66, 99 ... bytes Page Offset/Size: 1-4,5-9 ... lines Sets: 1-2-3,4-5-6 ... pages
    If all you're interested in is navigation, this might be enough.

    If the file doesn't need to stay hand editable, and you're interested in manipulation, I'd switch to a database like BerkeleyDB or DBD::SQLite, depending on whether or not SQL is overkill.

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      As I replied to fglock I can't change the files structure, only it's data. i.e. I can add/remove/change lines but I have no control over "how" they're represented.

      Each line is a fixed length record in itself, but each record type has a different structure. Say, a "print here" may be:

      x pos (6 char) y pos (6 char) string (300 char)

      and a "new sheet" may be

      sheet number (4 char) stock code (10 char)

      and so on (those aren't real structures above but similar to the real thing).

      I was toying with the idea of creating index files, but that comes with the overhead of having to parse the original file first to create them (if I could do that quickly it wouldn't be so much of a problem ;-)).

      --- Jay

      All code is untested unless otherwise stated.
      All opinions expressed are my own and are intended as guidance, not gospel; please treat what I say as such and as Abigail said Think for yourself.
      If in doubt ask.

Re: How to get fast random access to a large file?
by Anonymous Monk on Oct 30, 2004 at 00:42 UTC
    WindowsTextFiles have an interesting property. They end whenever the character "\x1A" is encountered. You can add this character and then the header that PodMaster hinted at. I would add the offset of the "\x1A" at the very end of the file to see if someone has edited the file without regard for your header or if it is newly generated and doesn't contain a header yet. BTW the access to the last part of a file is slow on all FAT-filesystems.
Re: How to get fast random access to a large file?
by graff (Chancellor) on Oct 31, 2004 at 04:39 UTC
    ... I need a method of being able to navigate around the file in as effeicient and speedy a manner possible... Any code that can read the file also needs to be able to write to it so that the file may be amended - currently this gets done by hand...

    And are you planning for the navigation to be done by hand as well (i.e. interactively: the user starts the program, then carries on some sort of dialog to get to a point of interest, make changes if needed, jump to another point, make changes, and so on, until a final save/exit)?

    Given that you can't redesign the file structure, that the files get to be up to 2 GB, and that the navigation/updates are to be controlled manually, my next question would be: is this just a one-shot or occasional process, or is it rather something that will be a continuing need, such that some extra code development -- and some extra cpu cycles the first time a given file is processed -- is justified?

    If you need to handle a lot of files this way, and especially if you need to revisit any given file numerous times, it will be worth your while to use a database (e.g. mysql, which is easy to install if you don't have it already -- and installing the Perl DBI and DBD::mysql shouldn't be much trouble either).

    Whatever creates the files in the first place does not need to change, and whatever uses the files after your seek/edit process is done can likewise remain unchanged. All you need is a front-end process to load a file into the database, a user interface to navigate and update the database records for a given file, and a back-end process to dump the database contents to a new file. The front- and back-end processes will run pretty quickly, and the UI can be optimized in any number of ways to be very quick and reliable, depending on what sort of information is needed to navigate.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://403695]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-04-20 02:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found