http://qs321.pair.com?node_id=433957


in reply to Displaying/buffering huge text files

If you're sure that indexing the start offset of every line will fit in memory, then go for it. You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

For that case, and the general case of even larger files, consider a variant of the index-every-nth line idea: index based on disk block (or more likely, some multiple of the disk block). Say you use a block size of 8KB. Then keep the line number of the first complete line starting within each block. When seeking to a given line number, you do a binary search in your index to find the block number that contains the largest line number less than or equal to the line you're looking for. Then you read the block in, scanning linearly through the text for the line you want.

This approach deals with the problematic cases more gracefully -- if you have a huge number of newlines, you'll still only read the block containing the line you want. (Well, you might have to read the following block too, to get the whole line.) Or, if you have enormous single lines, you'll never have the problem of your index giving you a starting position way before the line you want, as might happen if you were indexing every 25th line.

Generally speaking, your worst-case performance is defined in terms of the time to process some number of bytes, not lines, so you'll be better off if your index thinks in terms of bytes.

All that being said, I would guess that this would be overkill for your particular application, and you'd be better off with an offset-per-line index. It's simpler and good enough. And if that gets too big, you can always store the index in a gdbm file.

  • Comment on Re: Displaying/buffering huge text files

Replies are listed 'Best First'.
Re^2: Displaying/buffering huge text files
by scooper (Novice) on Feb 24, 2005 at 23:43 UTC
    sfink: You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

    a file containing *only* newlines is not a nasty case and requires no indexing at all. If you can figure out before you read it that the file contains only newlines, you change your "figure out the seek offset" subroutine so that to get to line 120000 it seeks to byte 120000. It doesn't get any easier!