comment on

The idea of storing every n-th index in the index table that a couple of monks brought up works even better than expected. I wasn't sure about it at first, but since my reading of the file is buffered, it emposes no performance cost.

My approach now is: I have a BUFFER_BLOCK, currently 1000 lines long. I store every BUFFER_BLOCK lines in the index, that is, 0th line, 1000th line, etc. When the class is asked for a line which is not in the buffer, it rounds the line to the highest BUFFER_BLOCK (I.e. for line 7781 it goes to 7000), grabs another BUFFER_BLOCK lines down an another up (that is 6000-9000 for line 7781) and returns the desired line.

This works like magic, and blazingly fast. I'm experimenting with a 100MB file now (~6 million lines). Reading and indexing it (I'm doing it now in C++ and the smaller amount of push_back to the vector gives gains) takes below 2 seconds ! Afterwards, accesses to lines that are not in buffer take ~70 ms (in buffer is immediate of course).

Memory consumption: the index table takes 4*1/BUFFER_BLOCK bytes for each line. That is, in the gigantic file I'm testing, it takes only 24 KB.

The buffer itself is 3000 lines at 30 chars / line on average, only 90 KB or so.

So, the class "mirrors" a 100 MB file, taking only about 120 KB of memory and working blazingly fast.

Thanks for all the good and interesting answers, monks. I wonder, though, if Perl can match C++'s speed here. Indexing a 100 MB file in 1.7 seconds is quite impressive.

In reply to Re: Displaying/buffering huge text files by spurperl
in thread Displaying/buffering huge text files by spurperl

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks