comment on

Nice, looks interesting and eminently useful. Of course the "any time the user suspects" part is a bit weak.

it's impossible to find the Nth occurrence of some phrase or word in a book without opening the book and counting your way through it.

I would like to note that after practice I became able to consistently open a book to the correct page in the case of a thick Japanese character dictionary (Nelson's) back in school. I think this may be like a lookup table that matches thickness of pages before one's thumb to a list of >100 chapters. At least it always worked for the most important chapter.

So if you know where a change had been made in a file, you could in fact jump to a prestudied location before that point, which you know has X occurrences of a pattern before it, and then count N-X occurrences starting from there instead. Metadata describing the various prestudied points (or results of prerun pattern matches) could be saved in a memo at the head of the index file.

You could also save a series of checksums per chapter (if not per line) and this could help determine where a change was made, though maybe Diff could do something similar. This would let you enjoy the benefits of a flat file, i.e. do regex pattern matching or tie the file to some module's object model like Config, while also enjoying some of the structure given by a record-based object store.

Personally I would probably rather have an index that operated based on keywords or patterns than using a recno. If the text file has a list of paragraphs, I could save a few words describing each paragraph in the index and then later jump to the Nth article matching a given keyword or above a certain score. Or perhaps I have a list of events in a calendar, and each would have an event type or event owner associated with it. In this case maybe I would like to have multiple lines per record, in other words the delimiter would not be "\n". Maybe I'd like a (not necessarily unique) date-based key, or a certain format serial number. These are just ideas.

I am trying to think of when I would want to use your new module, and I keep thinking of extracting descriptive words from text as in NLP (natural language processing) and saving them with each paragraph or sentence. Regardless of whether this is a single flat file or not, it would be useful, and a tool to navigate the precompiled index with pointers into the data would seem useful. Perhaps a callback or plugins for index creation would be useful.

At the moment I am thinking of indexing books, which make nice flat files. I wrote a little program that lets me read books from my server on my cell phone when on the train (turns out that's not cheap but..) anyway I read 10KB per page (max that fits in RAM and enough to reach the next station). It would be nice if I had an index built so as to allow me to make one page end at the end of a sentence, within the 10K limit. It is so much of a pain that currently I even split words across pages. A recno could be used as a bookmark, if the recno is created based on a page length and a "try not to break sentences across pages" heuristic. So to make a long story short, it would be interesting if your module would support creation of indices based on pages of a length decided somewhat intelligently. Would that be possible with your module? Keep up the good work!

In reply to Re: Proof of concept: File::Index by mattr
in thread Proof of concept: File::Index by davido

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Problems? Is your data what you think it is?
	PerlMonks