Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re: Re: Re: speeding up a file-based text search

by BrowserUk (Patriarch)
on May 07, 2003 at 20:24 UTC ( [id://256375]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: speeding up a file-based text search
in thread speeding up a file-based text search

Could you you show us a few examples queries that you wish to support, specifically, the format in which the queries are defined?


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: Re: Re: speeding up a file-based text search
by perrin (Chancellor) on May 07, 2003 at 20:42 UTC
    Here are the options:

    • query type: phrase, and, or
    • case-sensitive: yes, no
    • whole words only: yes, no

      The reasons you have said that using an inverted index isn't practical is that

      a) you need to support searching for phrases

      Matching phrases against an index is a case of splitting the phrase into its constituant words, and then intersecting the sets of record numbers that are returned from the index. (see Re: Idea for XPath implementation for slightly better explaination of this).

      And, or & not are just extensions of the set manipulations.

      b) you need to support partial matches.

      Partial matches are a bit more complex, but davorgs Tie::Hash::Regex as the basic for your inverted index,

      or use grep /partial.*., keys %index; (which what is used under the covers).

      This would probably involve using doing some manipulation of the input query to convert partial matches to regex notation (eg. bio* => bio[^\s]*), unless your users are comfortable using regex notation.

      Just a thought in case you haven't already considered this.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        Phrase matching is not the same as "and" matching. It's not enough for two words to both be in the same record; they have to be there next to each other in the correct order. A word list can't do that, although it can be used to qualify records for further checking. I can do partial matching as part of that, although it requires a full scan of the word list. I'm going to try it.

        Incidentally, I'm using index() instead of m// for partial matching, which should be faster. Giving users regex search capability is not a goal.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://256375]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-03-19 07:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found