http://qs321.pair.com?node_id=437827

spurperl has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed monks,

In Displaying/buffering huge text files I presented a need for a buffering module that will allow smooth display of huge text files in GUIs (read-only). A very interesting and live discussion commenced, and I concluded with the solution: using an internal buffer + decimated indexing of one-in-1000-lines that gave good performance with minimal memory consumption.

But, in real life, like in real life, complications tend to spring up unexpectedly. An additional requirement for this module now imposes some serious questions on the design.

The new requirement is, in essence, simple: there should be a way to filter certain lines out of a file, i.e. never show lines that start with "Foobar:".

At first this doesn't look tough, but given some though it complicates matters enormously. The most annoying thing in such requirements is that they actually make sense (filtering is important on very big files).

I can assume to have all filters beforehead. Say that I know that a user might want to filter out "Foobar:" lines. In any point in the GUI the user may ask to enable or disable the filter.

I'm now thinking of: making the filtering transparent to the GUI, in the buffer. The GUI requests line 115 - the buffer knows that if the file isn't filtered, it's the real line 115 from the file and acts according to its original algorithm. But if the buffer knows that filtering is enabled, it should provide the 115th unfiltered line.

It probably means that I need, on startup, create a separate indexing for each filter. Not only that, however, because the "real" distance between two adjacent unfiltered can be 5000 lines in the file. I wouldn't want to wade through them all just to find the next file.

In addition, indexing of filtered lines on startup imposes a severe performance hit. Instead of simply reading in each line and counting them, I should now actually also apply a regular expression to it.

Any ideas ? I guess I can make it fast sacrificing a lot of space, but that is not really good for me.

Replies are listed 'Best First'.
Re: Further on buffering huge text files
by dragonchild (Archbishop) on Mar 09, 2005 at 13:49 UTC
    Let me understand this - you want to have the ability to turn a filter off, thus recalculating the display. So, if the current filter had removed actual lines 1-100 and you are looking at filtered line 115 (real line 215) and you turn the filter off, you need to go back 100 lines and redisplay real line 115? I don't think that's what you want.

    The more intuitive thing to me would be to, when a filter is turned off, the filtered lines become available when the user scrolls to where the would-have-been-filtered lines are. The only change to the GUI is that you display filtered lines that now appear, keeping the center line still in the center. The way to do this is to have two buffers. This is going to be complicated, so bear with me.

    Let's say the GUI displays 21 lines - the line requested and 10 lines on either side. Your user has filters A and B turned on, filter C turned off, and asks to see line 115. So, you have to display lines 104-125, but showing only unfiltered lines. So, you pull in a line. $real_count increments. You try each filter in turn, even the ones that are turned off. You then mark which filters would hide this line. You also store the tell() for this line. (This will be used later.) Based on whether any active filter would hide the line, you increment $display_count. Keep doing this until you reach actual line 104. From here until display-able line 125, you keep doing the same thing, but you also add the actual line to the buffer of lines.

    Now, you're going to have in memory two arrays: the first is a list of actual line numbers, which filters remove them and the tell() for the beginning of the line, starting from actual line #1 going to display-able line #125. You also have a buffer of lines from the file from actual line #104 going to display-able line #125.

    To display, you go backwards through the first array, pulling out the first 21 actual line#'s that pass the filters. You then go into the second array, subtracting 104 from the indices, and get the actual data line. Put those in the buffer.

    Now, when a filter is turned on or off, you just repeat the "To display" part - you already have all the data you need.

    If scrolling happens, you have to determine if they're scrolling upwards or downards. If they're going upwards, you just have to grab the data starting at the top of the buffer (using the tell() data you stored before) and reading in the data from the file. You don't need to determine if the lines are buffered or not because you already know this. If scrolling happens downards, you're going to have to build up the information about each line.

    Now, you can use some sort of tied-hash-to-DBM type of solution to keep yourself from running out of RAM, but at a small-to-medium cost to performance.. Also, you can pre-process this information if the files and filters are relatively static, thereby incurring only the cost of the tied-hash.

    As for adding filters at runtime . . . users are just going to have to know that adding a new filter is going to be expensive - there's just no way around it. But, this algorithm should provide a lot of performance boosts.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Re: Further on buffering huge text files
by Jaap (Curate) on Mar 09, 2005 at 10:21 UTC
    Why not leave the line numbers as they were before filtering and simply not display the filtered lines? It would look something like:
    1211 dfgsdfsdfg 1212 sdfgdfgsdf 1213 sdfgsdfgsdfg 1273 sdfgsdfgsdfg 1274 sdhgsfdg
    The only problem with this approach is when the filter would filter out most (or all) of the lines in the buffer. The display would be rather empty then.
      You are absolutely right with the "only problem" you present. If you think about it, it's an unacceptable problem. Especially when most of the file is filtered out.
        Then keep reading untill the buffer is filled? The GUI would request say 120 lines starting from line 35465 (or byte/char 235435), you would start reading lines fom that position and keep reading and filtering untill you can return the 120 lines.
Re: Further on buffering huge text files
by Anonymous Monk on Mar 09, 2005 at 10:16 UTC
    You say you can assume all filters to be known before hand - does that also apply to the file? If not, you still can't avoid reading in all the lines and applying a regexp to the lines.

    This problems seems to shout "database, database, database". But that's probably not going to help you much if the files are very viotile.

      Yes, the filters are known. But the file isn't. The GUI widget should in theory open any huge file, and as fast as possible it should display it and a scroll bar.

      Databases won't cut here. It's a pre-written text file that is generated by an outside tool and the GUI should be able to read any such file. Besides, the file grows with time and the GUI should keep up.

        Can you cache information? If files only grow with time, and your filters are constant, you could record for each line written to the file (or better, for each line read from the file) which filter(s) it matches. Using a bit-vector for each filter, and a single bit per line, you only need 125k of storage per filter for a million line file.
Re: Further on buffering huge text files
by halley (Prior) on Mar 09, 2005 at 17:07 UTC
    This looks to be exactly the same kind of thing that emacs calls "selective display." In that editor mode, you can hide any image that is indented more than N spaces. Groups of one or more hidden lines appear in the editor as a ... mark appended to the previous visible line. Your cursor moves right over the elipses, you can select everything including an elipses, you can delete or copy chunks that include elipses, etc.

    The principle can be extended to any sort of do-or-don't-show filter, of course.

    • You will need to read through the whole file once.
    • You will need to index line positions, either indexing all lines, or filtered lines.
    • If you're threaded, you can index in a separate thread.
      Fill the display once the indexer has indexed enough to fill the first page, but you may need the display to "wait" for the indexer sometimes if you let the user move around before the indexer has finished.
    • You can save the index to a file if the filtration is too slow to perform often.
    • You can "index the index" for intensely huge sources, where even the index would outstrip memory available.
    • When the user wants to scroll into areas not on the screen, use your index to re-read the applicable portions.
    • When the user wants to see things that are hidden, use your index to re-read the applicable portions.

    Think of your index as being a hash of weak references to strings, if you know what that means. If the memory is tight, don't keep the strings around, but if they're needed again, then don't hesitate to reconstruct them.

    --
    [ e d @ h a l l e y . c c ]

Re: Further on buffering huge text files
by 5mi11er (Deacon) on Mar 09, 2005 at 17:21 UTC
    I agree with dragonchild's post above. I just wanted to specifically say that I don't agree with your contention that when asked for line 115, you should have to figure out which is the 115th line that happens to not to be filtered out. You should display the non-filtered lines around actual line 115. Whether 115 is actually displayed should be taken care of by dragonchild's display idea.

    -Scott