Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Building a search engine

by Anonymous Monk
on Nov 15, 2003 at 11:53 UTC ( [id://307309]=note: print w/replies, xml ) Need Help??


in reply to Building a search engine

Hello,

This is Matt (mattr) but PM is not letting me log in for some reason today. Hope the server isn't ailing..

Many good comments here. I have a db >1GB and 60 databases (and a few more websites) running under htdig in a mod_perl wrapper, and you can generate parseable data apparently, so you might consider it. You can see my system in Omron's search box. Note: the category match function is my own code, not htdig. I have had to build lots of administrative tools though there are some contributed scripts which will show you how to use it. I had to hack source because the crawler listens to robots.txt even if you really don't want it to do so, which was annoying. The newest version (3.2.0b5 which sounds pretty solid though I am not running it) does phrase matching apparently. Basically if you are administering your own system you probably will have fun with htdig.

I believe you can install htdig even if you do not have root access (as I would assume since you cannot install a database).

Namazu is a perl-based engine with docs in English and Japanese and document converters. However the documentation is not voluminous, and it makes a huge directory full of indexes so is a bit opaque. Might be useful for personal use though. It is not high-performance and only indexes local files.

There is also mifluz which I have not used and is in a perenially beta status but might be interesting.

I'm not going to cover WAIS (Wide Area Information Search) and glimpse though someone mentioned it.

Also if you have MySQL that might be useful, though I have not tested their fulltext search yet. And you are not allowed to run a database apparently(?).

If you are going to program something on your own, you basically want to make some kind of inverted index (it can become quite involved though if you want phrase searching for example). But you may achieve some useful performance using a C/C++ database with a Perl API. Possibly even the InvertedIndex module above may be enough for you. But do consider there are lots of little things about searching that make programming your own system a major neverending project, even just things like parsing out html headers, the weighting of results, different searching and indexing techniques, giving higher priority to certain information sources, field-based searching, runtime memory requirements, logging of searches and referrers, and so on.

I think the htdig-type system is very difficult for non-tech people to administer but it has a lot in it, including several search methods like fuzzy/synonym/homonym/stemming, and exact tweaking of tons of variables, like how much to index and whether to notify the admin of changes. And it does ranking, which is really important.

I've done a number of search engines from tiny to about 1GB size and now I've gotten interested in two related areas: NLP (natural language processing) and a technique called faceted metadata search which uses knowledge of data structure to inform concurrent navigation and narrowing down search results. With 200,000 files this is important, since unless you know exactly what you are looking for, without a metadata technique you will likely either get back too many results or too few. You can check out Seamark which has an interesting white paper and Endeca which has a good flash-based wine search demo. I'm just mentioning this since there seem to be some people interesting in search engines around here and if you are building your own, then using structure information can greatly reduce your performance requirements.

By the way that article of Damian Conway's on text searching in a vector space is pretty neat. Anyway one of the links at the end of it is old - the Nitle link for semantic searching (which incidentally is the bridge between full text search and metadata search) is here. Why not experiment and tell us what you come up with?

Matt (mattr -at- telebody /dot/ com)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://307309]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-24 18:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found