Re: speeding up a file-based text search

How large would an inverted word list be? It might work better to try to match against a join(' ', @wordlist) (computed only once, obviously). Unless your data is highly random, for a document of 20MB that string probably won't exceed a few dozen kbytes (if it's even that large). Upon finding a match, pos, index, rindex, substr would serve to extract the full word the match landed on, which you can then look up in your inverted word list.

Regex::PreSuf would be of use to increase the pattern efficiency if it's still an issue.

Makeshifts last the longest.

Comment on Re: speeding up a file-based text search

Replies are listed 'Best First'.
Re: Re: speeding up a file-based text search by perrin (Chancellor) on May 07, 2003 at 20:15 UTC
I might be able to do something with a word list. This is actually a medical glossary, so the number of distinct terms may be higher than common documents, but probably not too bad. The situation is complicated by the fact that I need to support phrase searching as well as and/or boolean searching. I could use a word list just to qualify records for further checking. I already have a dbm file for fast random access to the records once I know which ones I want. I'll try that and see how much of a difference it makes.	[reply]
Re: Re: Re: speeding up a file-based text search by BrowserUk (Patriarch) on May 07, 2003 at 20:24 UTC
Could you you show us a few examples queries that you wish to support, specifically, the format in which the queries are defined? Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller	[reply]
Re: Re: Re: Re: speeding up a file-based text search by perrin (Chancellor) on May 07, 2003 at 20:42 UTC
Here are the options: query type: phrase, and, or case-sensitive: yes, no whole words only: yes, no	[reply]
Re: Re: Re: Re: Re: speeding up a file-based text search by BrowserUk (Patriarch) on May 07, 2003 at 21:56 UTC
Re: Re: Re: Re: Re: Re: speeding up a file-based text search by perrin (Chancellor) on May 07, 2003 at 22:06 UTC
Some notes below your chosen depth have not been shown here