Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

HTML::Index module -- what's the story?

by Cody Pendant (Prior)
on Nov 22, 2005 at 00:55 UTC ( [id://510604]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I went looking on CPAN for a better small search engine solution than rolling my own, again, and found only http://search.cpan.org/~awrigley/HTML-Index-0.15/lib/HTML/Index.pm.

First of all, it hasn't been updated since 2003, secondly it wouldn't install for some reason, but thirdly, I don't really mind that it didn't install because, looking back, as part of the installation it was installing Lingua::Stem modules for half of Europe. That's way over the top for what I wanted to do.

Are there other modules, or other Perl code, for efficiently indexing a few hundred documents on a website, or is the task of searching websites considered only suitable for things like htdig, which in this case would be something of a sledgehammer to crack a nut?



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
  • Comment on HTML::Index module -- what's the story?

Replies are listed 'Best First'.
Re: HTML::Index module -- what's the story?
by cees (Curate) on Nov 22, 2005 at 03:46 UTC

    If you want something easy to use, have a look at CGI::Application::Search. It uses the Swish-e search engine to do the indexing and has a nice set of features. Here is a list straight from the docs:

    • Sub-Classable. Unlike the Perl examples that come with swish-e, this is not a script, and can be customized without modifiying the original so that several sites may share the same underlying code.
    • Uses CGI::Application::Plugin::AnyTemplate to allow flexibility in template engine choice (HTML::Template, Template-Toolkit or Petal).
    • Built-in templates to use out of box or as examples for your own templates
    • HiLighted search results
    • HiLighted pages linked from search results
    • AJAX results sent to page without need of a page reload
    • AJAX powered 'auto-suggest' to give the user list of choices available for search
Re: HTML::Index module -- what's the story?
by friedo (Prior) on Nov 22, 2005 at 03:35 UTC
    I hear Swish-e is the bee's knees. It's mostly written in C and it has a Perl API on CPAN.
Re: HTML::Index module -- what's the story?
by creamygoodness (Curate) on Nov 22, 2005 at 04:12 UTC
    You're right that there are few small options for a search engine, but that's because user expectations for the vast majority of applications are difficult to meet with a small scale engine. Are your users really going to stand for non-stemmed searching? You've done this before, so of course I'll take your word for it, but that definitely puts you in the minority.

    All the search engine libraries use Lingua::Stem or Lingua::Stem::Snowball, because it would make zero sense to reinvent that wheel. They only come in one package -- Lingua::Stem installs Lingua::Stem::Snowball -- which is mildly unfortunate because Snowball is XS and you need a C compiler. However, I can testify that it's very difficult to write a search engine which scales well to extremely large document collections in pure Perl.

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com

      As the author of Lingua::Stem I have to correct this: Lingua::Stem is a pure Perl module collection. That is in fact probably the single largest practical difference between it and Lingua::Stem::Snowball (which is entirely XS based). While Lingua::Stem uses Lingua::Stem::Snowball::Da, Lingua::Stem::Snowball::No and Lingua::Stem::Snowball::Se as 'plugin' components - those modules are standalone pure Perl items that are completely independant of the main Lingua::Stem::Snowball distribution even though they share Lingua::Stem::Snowball's namespace.

      As to the complaint that Lingua::Stem installs unwanted European stemmers, I think that is a matter of perspective: Some Europeans might complain that it installs an unwanted English stemmer ;).

      Distributions like Lingua::Stem and Lingua::Stem::Snowball have multiple user bases by design. They are intended to create standards for implementing the type of module so that there are not dozens of different APIs and namespaces for modules that all basically do the same thing for slightly different audiences. Other than using a small amount of extra disk space, that there are features you don't need for your particular use isn't really an issue as long as their presence doesn't interfere with your use.

Re: HTML::Index module -- what's the story?
by lestrrat (Deacon) on Nov 22, 2005 at 10:19 UTC
    <plug>Senna is quite nice too</plug> (although it was originally designed for Japanese fulltext search, it performs quite nicely for english as well)
Re: HTML::Index module -- what's the story?
by Anonymous Monk on Nov 22, 2005 at 07:38 UTC
    First of all, it hasn't been updated since 2003, secondly it wouldn't install for some reason, but thirdly, I don't really mind that it didn't install because, looking back, as part of the installation it was installing Lingua::Stem modules for half of Europe. That's way over the top for what I wanted to do.
    1) HTML::Index doesn't attempt to install any modules

    2) HTML::Index only requires Lingua::Stem

    3) If you're having problem installing HTML::Index, there is very little hope you'll find any solution to your problem.

      1 & 2) HTML::Index doesn't 'install' anything, but it does require Lingua::Stem, BerkeleyDB, HTML::TreeBuilder, Carp::Assert and Compress::Zlib - which 'cpan' helpfully does try to install.

      3) HTML::Index doesn't appear to pass its own build tests on either 5.8.x or 5.6.x, apparently because the number of build tests it declares is different than the number it actually runs for some unknown reason. I don't think the OP is unreasonable in thinking that that is a significant problem with the distribution.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://510604]
Approved by Errto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-24 08:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found