Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Current best practice in Perl internal search engines?

by punch_card_don (Curate)
on Jan 21, 2010 at 02:15 UTC ( [id://818613]=perlquestion: print w/replies, xml ) Need Help??

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Maudlin Monks,

Seems every 3 or 4 years comes time to review the client's search engine to bring it up to date.

Long ago we ran the old Perlfect search, which was pretty good. Then we moved to a home-grown engine. Time to update it again.....

This is a mid-sized commercial site. A few hundred html / shtml pages; a few thousand pdf's; a smattering of Word and Excel documents. Really low traffic - we're talking tens of searches a day, not thousands.

So were looking for the best version of modern functionalities - especially good relevancy ordering. And, of course, in Perl. Something off-the-shelf-and-customizable would be an ideal way to avoid joing the Google-Search crowd.

Is there anything new under the sun?

Thanks.




Time flies like an arrow. Fruit flies like a banana.
  • Comment on Current best practice in Perl internal search engines?

Replies are listed 'Best First'.
Re: Current best practice in Perl internal search engines?
by Your Mother (Archbishop) on Jan 21, 2010 at 02:27 UTC

    KinoSearch is fantastic and fairly easy to use. It has term frequency–inverse document frequency weighting and query grouping and logic modifiers out of the box. The author is a prince among developers. Use the dev branch (0.3). It's the future and it works quite well.

Re: Current best practice in Perl internal search engines?
by Khen1950fx (Canon) on Jan 21, 2010 at 09:08 UTC
    If I only had tens of searches, I'd go with something quick and easy like Search::VectorSpace. The module's author wrote an article for perl.com about building a vector-space search engine. Here's the search engine given by the author:
    #!/usr/bin/perl use strict; use warnings; use Search::VectorSpace; my @docs = get_documents_from_somewhere(); my $engine = Search::VectorSpace->new(docs =>\@docs); $engine->build_index(); $engine->set_threshold(0.8); while ( my $query = <> ) { my %results = $engine->search( $query ); foreach my $result( sort { $results{$b} <=> $results{$a} } keys %results ) { print "Relevance: ", $results{$result}, "\n"; print $result, "\n\n"; } print "Next query?\n"; }
Re: Current best practice in Perl internal search engines?
by karpet (Acolyte) on Jan 21, 2010 at 16:06 UTC
    KinoSearch is good (I use it) but it will require some code writing to get the document aggregation and parsing working, especially if you have non-text (e.g. Word and PDF) docs. Swish-e (http://swish-e.org/) is what powers search.cpan.org (last I knew). It has a binary tool for indexing, and a Perl API (SWISH::API) for searching. There are also out-of-the-box .cgi scripts for searching. Disclaimer: I am a developer for both KinoSearch and Swish-e. :)
Re: Current best practice in Perl internal search engines?
by derby (Abbot) on Jan 21, 2010 at 12:33 UTC

    I try to use perl for everything but (there's always a but), when it comes to search, I find setting up a solr+tomcat engine the least painful. Then using a combination of LWP and JSON makes using the engine a breeze.

    -derby

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://818613]
Approved by planetscape
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 20:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found