Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

How to Simply Search a Web Site?

by ajt (Prior)
on Jan 17, 2002 at 21:20 UTC ( [id://139552]=perlquestion: print w/replies, xml ) Need Help??

ajt has asked for the wisdom of the Perl Monks concerning the following question:

I've had a look at Google, CPAN and here via SuperSearch, but "search" is such a popular term that it's hard to find what I want.

I wish to provide users of my site with a Simple search function. I will have about 1500 small pages of HTML built from XML, and I'd like to allow users to search them (free-text). The pages will be built via XSLT from XML, so I have the option to dump XML meta-data out to an index file should that be useful. I don't expect a lot of load but you never know...

I have two phases to the project, get a quick demo working within a month, on a much reduced data set, and then get a solid proposal together for a serious long term solution. The final machine will be a average BSD box, running Apache/mod_Perl/MySQL (though this is configurable).

My questions are:

  • Are there any simple scripts lying around that will suffice for the demo? I gather there is a NMS Search, any more?
  • Is it easy to build something from scratch that will work okay for a demo? and where do I look for guidance?
  • Does anyone have estimates how far I can push a simple script before I need something heavier duty?
  • A nice Perl based search engine that works in batch mode (on a fast devleopment server over night, under low load), and the has a quick search CGI that uses that nicely generated index would be useful, anyone know of one?

Ideally I'd like a Perl based solution that costs nothing and works..... Buying a copy of Verity Search isn't an option, and I'd like to avoid paying someone like Google a lot of money if it's not used much. Thoughts, insights and tips warmly welcomed.

As ever, many thanks in advance....

Replies are listed 'Best First'.
(Ovid) Re: How to Simply Search a Web Site?
by Ovid (Cardinal) on Jan 17, 2002 at 21:29 UTC

    You might want to check out Perlfect Search 3.20. I had it up and running after only a few minutes and it's fairly customizable. There are other, more customizable alternatives out there, but I liked that over some of them because you can either walk directly through your documents or use it to spider the Web site via http and thereby pick up your dynamic content, too. Oh, and it's open source, too :)

    Cheers,
    Ovid

    Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.

Re: How to Simply Search a Web Site?
by footpad (Abbot) on Jan 17, 2002 at 21:54 UTC

    Several resources come to mind. For example:

    • merlyn has written some columns on the subject. You may find this and this helpful. (This is interesting, too.)

    • O'Reilly's CGI Programming with Perl offers one starting place. You can find the code online, but the book is well worth the investment.

    • The Perlfect Search engine is quite good.

    • CPAN contains several modules of interest. Consider, for example, this search (alt.)

    • Finally, there are scads of nodes related to the subject. Consider the results of this simple search: site search engine.

    To paraphrase the tag line of a certain TV show, the code is out there.

    --f

Re: How to Simply Search a Web Site?
by ViceRaid (Chaplain) on Jan 17, 2002 at 21:42 UTC

    My first thought is that if you're just demoing it in the first instance, it might be easier just to use a free site-search facility, like the one offered by Google. OK, so you have to suffer some branding, but it's very easy to roll out for prototype purposes.

    I definitely wouldn't try and write something from scratch; the rules that make a search work are surprisingly complex; it might look like a simple =~ m/$searchword/, but any search engine has plenty of things to consider like:

    • proximity: when searching for more than one word, you might want to rate the result higher where the words are close
    • stemming: when searching for "rain", you might want to find results about "raining", "rainy"
    • weighting: when searching for a word, how do you tell apart documents where the term is only mentioned once, possibly as a cross-reference ('for more information about rain, see "All About Rain"') from the much more relevant and useful document "All About Rain".
    • ....

    For a good, free, open implementation of searching, consider HT://dig which isn't in Perl, but is GPL'd, and widely used in academic environments. It can index an HTML site and search its index fast and effectively.

    If you're really keen to do it yourself in Perl (it's an interesting project to do), there's plenty of modules on CPAN which might be useful, particularly under Text:: (eg Text::Query), and possibly Lingua:: (eg Lingua::Stem). Since you've got the site source in XML, it should make generation of index of meaningful content much easier.

    A

    it's raining here

Re: How to Simply Search a Web Site?
by dhable (Monk) on Jan 17, 2002 at 23:13 UTC
    bignosebird.com has a few search engines written in Perl listed. I deployed the Xavatoria search engine on a company site. It was fast, but we had to setup a cron job to reindex the site every so often. If your pages don't change much, that might be a viable solution.
Re: How to Simply Search a Web Site?
by George_Sherston (Vicar) on Jan 18, 2002 at 02:31 UTC
Re: How to Simply Search a Web Site?
by cLive ;-) (Prior) on Jan 18, 2002 at 11:58 UTC
    For qick and dirty, I'd send a request to google, restricted to the specific site, ie:

    +site:domain.com +searchterm

    If you don't want the Google branding, use LWP::UserAgent and strip out the results (this may not *techniquely* be legal though, so be wary).

    cLive ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://139552]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-23 12:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found