Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: speeding up a file-based text search

by Aristotle (Chancellor)
on May 07, 2003 at 01:06 UTC ( #256090=note: print w/replies, xml ) Need Help??

in reply to speeding up a file-based text search

How large would an inverted word list be? It might work better to try to match against a join(' ', @wordlist) (computed only once, obviously). Unless your data is highly random, for a document of 20MB that string probably won't exceed a few dozen kbytes (if it's even that large). Upon finding a match, pos, index, rindex, substr would serve to extract the full word the match landed on, which you can then look up in your inverted word list.

Regex::PreSuf would be of use to increase the pattern efficiency if it's still an issue.

Makeshifts last the longest.

  • Comment on Re: speeding up a file-based text search

Replies are listed 'Best First'.
Re: Re: speeding up a file-based text search
by perrin (Chancellor) on May 07, 2003 at 20:15 UTC
    I might be able to do something with a word list. This is actually a medical glossary, so the number of distinct terms may be higher than common documents, but probably not too bad. The situation is complicated by the fact that I need to support phrase searching as well as and/or boolean searching. I could use a word list just to qualify records for further checking. I already have a dbm file for fast random access to the records once I know which ones I want. I'll try that and see how much of a difference it makes.

      Could you you show us a few examples queries that you wish to support, specifically, the format in which the queries are defined?

      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
        Here are the options:

        • query type: phrase, and, or
        • case-sensitive: yes, no
        • whole words only: yes, no

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://256090]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2020-10-27 06:29 GMT
Find Nodes?
    Voting Booth?
    My favourite web site is:

    Results (256 votes). Check out past polls.