Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Re: speeding up a file-based text search

by perrin (Chancellor)
on May 07, 2003 at 20:15 UTC ( #256372=note: print w/replies, xml ) Need Help??


in reply to Re: speeding up a file-based text search
in thread speeding up a file-based text search

I might be able to do something with a word list. This is actually a medical glossary, so the number of distinct terms may be higher than common documents, but probably not too bad. The situation is complicated by the fact that I need to support phrase searching as well as and/or boolean searching. I could use a word list just to qualify records for further checking. I already have a dbm file for fast random access to the records once I know which ones I want. I'll try that and see how much of a difference it makes.
  • Comment on Re: Re: speeding up a file-based text search

Replies are listed 'Best First'.
Re: Re: Re: speeding up a file-based text search
by BrowserUk (Pope) on May 07, 2003 at 20:24 UTC

    Could you you show us a few examples queries that you wish to support, specifically, the format in which the queries are defined?


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      Here are the options:

      • query type: phrase, and, or
      • case-sensitive: yes, no
      • whole words only: yes, no

        The reasons you have said that using an inverted index isn't practical is that

        a) you need to support searching for phrases

        Matching phrases against an index is a case of splitting the phrase into its constituant words, and then intersecting the sets of record numbers that are returned from the index. (see Re: Idea for XPath implementation for slightly better explaination of this).

        And, or & not are just extensions of the set manipulations.

        b) you need to support partial matches.

        Partial matches are a bit more complex, but davorgs Tie::Hash::Regex as the basic for your inverted index,

        or use grep /partial.*., keys %index; (which what is used under the covers).

        This would probably involve using doing some manipulation of the input query to convert partial matches to regex notation (eg. bio* => bio[^\s]*), unless your users are comfortable using regex notation.

        Just a thought in case you haven't already considered this.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://256372]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (1)
As of 2020-10-25 17:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (249 votes). Check out past polls.

    Notices?