Your skill will accomplish
what the force of many cannot
Further on buffering huge text filesby spurperl (Priest)
|on Mar 09, 2005 at 08:47 UTC||Need Help??|
spurperl has asked for the wisdom of the Perl Monks concerning the following question:
In Displaying/buffering huge text files I presented a need for a buffering module that will allow smooth display of huge text files in GUIs (read-only). A very interesting and live discussion commenced, and I concluded with the solution: using an internal buffer + decimated indexing of one-in-1000-lines that gave good performance with minimal memory consumption.
But, in real life, like in real life, complications tend to spring up unexpectedly. An additional requirement for this module now imposes some serious questions on the design.
The new requirement is, in essence, simple: there should be a way to filter certain lines out of a file, i.e. never show lines that start with "Foobar:".
At first this doesn't look tough, but given some though it complicates matters enormously. The most annoying thing in such requirements is that they actually make sense (filtering is important on very big files).
I can assume to have all filters beforehead. Say that I know that a user might want to filter out "Foobar:" lines. In any point in the GUI the user may ask to enable or disable the filter.
I'm now thinking of: making the filtering transparent to the GUI, in the buffer. The GUI requests line 115 - the buffer knows that if the file isn't filtered, it's the real line 115 from the file and acts according to its original algorithm. But if the buffer knows that filtering is enabled, it should provide the 115th unfiltered line.
It probably means that I need, on startup, create a separate indexing for each filter. Not only that, however, because the "real" distance between two adjacent unfiltered can be 5000 lines in the file. I wouldn't want to wade through them all just to find the next file.
In addition, indexing of filtered lines on startup imposes a severe performance hit. Instead of simply reading in each line and counting them, I should now actually also apply a regular expression to it.
Any ideas ? I guess I can make it fast sacrificing a lot of space, but that is not really good for me.