http://qs321.pair.com?node_id=256375


in reply to Re: Re: speeding up a file-based text search
in thread speeding up a file-based text search

Could you you show us a few examples queries that you wish to support, specifically, the format in which the queries are defined?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
  • Comment on Re: Re: Re: speeding up a file-based text search

Replies are listed 'Best First'.
Re: Re: Re: Re: speeding up a file-based text search
by perrin (Chancellor) on May 07, 2003 at 20:42 UTC
    Here are the options:

    • query type: phrase, and, or
    • case-sensitive: yes, no
    • whole words only: yes, no

      The reasons you have said that using an inverted index isn't practical is that

      a) you need to support searching for phrases

      Matching phrases against an index is a case of splitting the phrase into its constituant words, and then intersecting the sets of record numbers that are returned from the index. (see Re: Idea for XPath implementation for slightly better explaination of this).

      And, or & not are just extensions of the set manipulations.

      b) you need to support partial matches.

      Partial matches are a bit more complex, but davorgs Tie::Hash::Regex as the basic for your inverted index,

      or use grep /partial.*., keys %index; (which what is used under the covers).

      This would probably involve using doing some manipulation of the input query to convert partial matches to regex notation (eg. bio* => bio[^\s]*), unless your users are comfortable using regex notation.

      Just a thought in case you haven't already considered this.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        Phrase matching is not the same as "and" matching. It's not enough for two words to both be in the same record; they have to be there next to each other in the correct order. A word list can't do that, although it can be used to qualify records for further checking. I can do partial matching as part of that, although it requires a full scan of the word list. I'm going to try it.

        Incidentally, I'm using index() instead of m// for partial matching, which should be faster. Giving users regex search capability is not a goal.