comment on

If you're sure that indexing the start offset of every line will fit in memory, then go for it. You can handle a lot of data that way -- but just be sure you don't hit the nasty cases, such as a file containing only newlines.

For that case, and the general case of even larger files, consider a variant of the index-every-nth line idea: index based on disk block (or more likely, some multiple of the disk block). Say you use a block size of 8KB. Then keep the line number of the first complete line starting within each block. When seeking to a given line number, you do a binary search in your index to find the block number that contains the largest line number less than or equal to the line you're looking for. Then you read the block in, scanning linearly through the text for the line you want.

This approach deals with the problematic cases more gracefully -- if you have a huge number of newlines, you'll still only read the block containing the line you want. (Well, you might have to read the following block too, to get the whole line.) Or, if you have enormous single lines, you'll never have the problem of your index giving you a starting position way before the line you want, as might happen if you were indexing every 25th line.

Generally speaking, your worst-case performance is defined in terms of the time to process some number of bytes, not lines, so you'll be better off if your index thinks in terms of bytes.

All that being said, I would guess that this would be overkill for your particular application, and you'd be better off with an offset-per-line index. It's simpler and good enough. And if that gets too big, you can always store the index in a gdbm file.

In reply to Re: Displaying/buffering huge text files by sfink
in thread Displaying/buffering huge text files by spurperl

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


P is for Practical
	PerlMonks