Re: Further on buffering huge text files
by dragonchild (Archbishop) on Mar 09, 2005 at 13:49 UTC
|
Let me understand this - you want to have the ability to turn a filter off, thus recalculating the display. So, if the current filter had removed actual lines 1-100 and you are looking at filtered line 115 (real line 215) and you turn the filter off, you need to go back 100 lines and redisplay real line 115? I don't think that's what you want.
The more intuitive thing to me would be to, when a filter is turned off, the filtered lines become available when the user scrolls to where the would-have-been-filtered lines are. The only change to the GUI is that you display filtered lines that now appear, keeping the center line still in the center. The way to do this is to have two buffers. This is going to be complicated, so bear with me.
Let's say the GUI displays 21 lines - the line requested and 10 lines on either side. Your user has filters A and B turned on, filter C turned off, and asks to see line 115. So, you have to display lines 104-125, but showing only unfiltered lines. So, you pull in a line. $real_count increments. You try each filter in turn, even the ones that are turned off. You then mark which filters would hide this line. You also store the tell() for this line. (This will be used later.) Based on whether any active filter would hide the line, you increment $display_count. Keep doing this until you reach actual line 104. From here until display-able line 125, you keep doing the same thing, but you also add the actual line to the buffer of lines.
Now, you're going to have in memory two arrays: the first is a list of actual line numbers, which filters remove them and the tell() for the beginning of the line, starting from actual line #1 going to display-able line #125. You also have a buffer of lines from the file from actual line #104 going to display-able line #125.
To display, you go backwards through the first array, pulling out the first 21 actual line#'s that pass the filters. You then go into the second array, subtracting 104 from the indices, and get the actual data line. Put those in the buffer.
Now, when a filter is turned on or off, you just repeat the "To display" part - you already have all the data you need.
If scrolling happens, you have to determine if they're scrolling upwards or downards. If they're going upwards, you just have to grab the data starting at the top of the buffer (using the tell() data you stored before) and reading in the data from the file. You don't need to determine if the lines are buffered or not because you already know this. If scrolling happens downards, you're going to have to build up the information about each line.
Now, you can use some sort of tied-hash-to-DBM type of solution to keep yourself from running out of RAM, but at a small-to-medium cost to performance.. Also, you can pre-process this information if the files and filters are relatively static, thereby incurring only the cost of the tied-hash.
As for adding filters at runtime . . . users are just going to have to know that adding a new filter is going to be expensive - there's just no way around it. But, this algorithm should provide a lot of performance boosts.
Being right, does not endow the right to be rude; politeness costs nothing. Being unknowing, is not the same as being stupid. Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence. Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.
| [reply] |
Re: Further on buffering huge text files
by Jaap (Curate) on Mar 09, 2005 at 10:21 UTC
|
Why not leave the line numbers as they were before filtering and simply not display the filtered lines? It would look something like:
1211 dfgsdfsdfg
1212 sdfgdfgsdf
1213 sdfgsdfgsdfg
1273 sdfgsdfgsdfg
1274 sdhgsfdg
The only problem with this approach is when the filter would filter out most (or all) of the lines in the buffer. The display would be rather empty then. | [reply] [d/l] |
|
You are absolutely right with the "only problem" you present. If you think about it, it's an unacceptable problem. Especially when most of the file is filtered out.
| [reply] |
|
Then keep reading untill the buffer is filled? The GUI would request say 120 lines starting from line 35465 (or byte/char 235435), you would start reading lines fom that position and keep reading and filtering untill you can return the 120 lines.
| [reply] |
Re: Further on buffering huge text files
by Anonymous Monk on Mar 09, 2005 at 10:16 UTC
|
You say you can assume all filters to be known before hand - does that also apply to the file? If not, you still can't avoid reading in all the lines and applying a regexp to the lines.
This problems seems to shout "database, database, database". But that's probably not going to help you much if the files are very viotile. | [reply] |
|
| [reply] |
|
Can you cache information? If files only grow with time, and your filters are constant, you could record for each line written to the file (or better, for each line read from the file) which filter(s) it matches. Using a bit-vector for each filter, and a single bit per line, you only need 125k of storage per filter for a million line file.
| [reply] |
Re: Further on buffering huge text files
by halley (Prior) on Mar 09, 2005 at 17:07 UTC
|
This looks to be exactly the same kind of thing that emacs calls "selective display." In that editor mode, you can hide any image that is indented more than N spaces. Groups of one or more hidden lines appear in the editor as a ... mark appended to the previous visible line. Your cursor moves right over the elipses, you can select everything including an elipses, you can delete or copy chunks that include elipses, etc.
The principle can be extended to any sort of do-or-don't-show filter, of course.
- You will need to read through the whole file once.
- You will need to index line positions, either indexing all lines, or filtered lines.
- If you're threaded, you can index in a separate thread.
Fill the display once the indexer has indexed enough to fill the first page, but you may need the display to "wait" for the indexer sometimes if you let the user move around before the indexer has finished.
- You can save the index to a file if the filtration is too slow to perform often.
- You can "index the index" for intensely huge sources, where even the index would outstrip memory available.
- When the user wants to scroll into areas not on the screen, use your index to re-read the applicable portions.
- When the user wants to see things that are hidden, use your index to re-read the applicable portions.
Think of your index as being a hash of weak references to strings, if you know what that means. If the memory is tight, don't keep the strings around, but if they're needed again, then don't hesitate to reconstruct them.
-- [ e d @ h a l l e y . c c ]
| [reply] [d/l] |
Re: Further on buffering huge text files
by 5mi11er (Deacon) on Mar 09, 2005 at 17:21 UTC
|
I agree with dragonchild's post above. I just wanted to specifically say that I don't agree with your contention that when asked for line 115, you should have to figure out which is the 115th line that happens to not to be filtered out. You should display the non-filtered lines around actual line 115. Whether 115 is actually displayed should be taken care of by dragonchild's display idea.
-Scott | [reply] |