Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Re: Re: Re: Re: speeding up a file-based text search

by BrowserUk (Pope)
on May 07, 2003 at 21:56 UTC ( #256391=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Re: Re: speeding up a file-based text search
in thread speeding up a file-based text search

The reasons you have said that using an inverted index isn't practical is that

a) you need to support searching for phrases

Matching phrases against an index is a case of splitting the phrase into its constituant words, and then intersecting the sets of record numbers that are returned from the index. (see Re: Idea for XPath implementation for slightly better explaination of this).

And, or & not are just extensions of the set manipulations.

b) you need to support partial matches.

Partial matches are a bit more complex, but davorgs Tie::Hash::Regex as the basic for your inverted index,

or use grep /partial.*., keys %index; (which what is used under the covers).

This would probably involve using doing some manipulation of the input query to convert partial matches to regex notation (eg. bio* => bio[^\s]*), unless your users are comfortable using regex notation.

Just a thought in case you haven't already considered this.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: Re: Re: Re: Re: speeding up a file-based text search
by perrin (Chancellor) on May 07, 2003 at 22:06 UTC
    Phrase matching is not the same as "and" matching. It's not enough for two words to both be in the same record; they have to be there next to each other in the correct order. A word list can't do that, although it can be used to qualify records for further checking. I can do partial matching as part of that, although it requires a full scan of the word list. I'm going to try it.

    Incidentally, I'm using index() instead of m// for partial matching, which should be faster. Giving users regex search capability is not a goal.

      One possibility for phrase matching is to build a second layer of indexing. The first layer is "This word exists". The second layer is "This other word is right after me at some place in the document".

      Now, this will give the possibility of false matches, depending on how you index. For example, the phrase "in the" might end up matching "Come on in. The tea is on the stove."

      Another problem is 3+ word phrases. The system I'm proposing will tell you if pairs are in the right order. But, using the above snippet, "in the stove" would match that document because "in the" and "the stove" are both phrases that exist, even though "in the stove" isn't there.

      But, it all depends on how perfect you want to be. "Good enough, I can give you now. Perfect will be along tomorrow."

      ------
      We are the carpenters and bricklayers of the Information Age.

      Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

      Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Phrase matching is not the same as "and" matching.

      Agreed. Sorry if implied otherwise. But once you have a list of the records that contain all the terms, then validating these against the original phrase is considerably less costly than searching the whole 20MB.

      Good point. Using index on the keys of the hash does make more sense.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      Depends on your word list. You could store the in-record location(s) of the word as well; then, when doing a phrase search, you can intersect the sets for each word by record and then check for consecutive locations in the correct order. This is, AFAIK and at least roughly, the way all of the big web search engines work.

      Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://256391]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (9)
As of 2020-10-20 15:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (210 votes). Check out past polls.

    Notices?