Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Re: Offsite Perlmonks Search Engine

by blakem (Monsignor)
on Jul 07, 2002 at 14:39 UTC ( [id://179971]=note: print w/replies, xml ) Need Help??


in reply to Re: Offsite Perlmonks Search Engine
in thread Offsite Perlmonks Search Engine

Earlier this week, someone complained about not being able to search this site for 'AI' so some two letter words are worth keeping. I do have a very short list of "stopwords" that I can tweak if need be. As far as load to the server... I have no idea... guess I'll find out. ;-)

I could have the "word search" behavior be optional. The current matching (done in the SQL) is similar to /\b$term\b/ but it would be easy enough to let the user turn off those boundary assertions.

The "Terms are split on spaces after non-word chars are stripped" is a roundabout way of saying that I'm ignoring quotes. Searching for dogs cats and "perl 6" will get broken down into five terms. dogs, cats, and, perl, 6 '6' gets tossed out because its too short, 'and' is one of the stop words so it is removed as well. That leaves us with dogs, cats, perl and a bunch of bad results. The underscore gives us an easy way out, ala perl_6.

Thanks for the feedback... I'll probably incorporate the optional "word search" feature in the next rev.

Update: A partial word matching option has now been implemented...

-Blake

Replies are listed 'Best First'.
Re: Re: Re: Offsite Perlmonks Search Engine
by RMGir (Prior) on Jul 07, 2002 at 14:43 UTC
    The problem is that if you turn off word search, you'll be doing something like "LIKE '%searchItem%'", right? I think that might make your load alot worse...
    --
    Mike
Re: Re: Re: Offsite Perlmonks Search Engine
by Elian (Parson) on Jul 08, 2002 at 01:34 UTC
    If you find that searching gets to be a performance bottleneck, one thing you can do is to build custom indices for the pages in the database. You can build an index for each word in the text, and another with word pairs in the text. (This, for example, would have an entry "if you", "you find", "find that" and so on) Searching for phrases is just a matter of splitting the phrase into pairs and searching for documents that match all the pairs. (It's generally good enough)

    You might not have to do this--there's only 180K pages here, so full-text searches may very well not be performance bottlenecks at the moment.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://179971]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-20 10:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found