Building a search engine

artist has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Building a search engine
by valdez (Monsignor) on Nov 13, 2003 at 16:24 UTC

You could start with Adding Search Functionality to Perl Applications, Building a Vector Space Search Engine in Perl and Designing a Search Engine. There are also many nodes here, for example:

Ciao, Valerio

[reply]

Re: Re: Building a search engine

by benizi (Hermit) on Nov 13, 2003 at 22:26 UTC

Building a Vector Space Search Engine in Perl sprang to mind when I saw this question. ++Valerio

covers many common pitfalls
gives great sample code
has great links to related design/theory

[reply]

Re: Building a search engine
by PodMaster (Abbot) on Nov 13, 2003 at 16:16 UTC

http://search.cpan.org/~alian/Search-Circa-1.18/ - looks like drop in solution
http://search.cpan.org/~awrigley/HTML-Index-0.15/ - looks like drop in solution
http://search.cpan.org/~snowhare/Search-InvertedIndex-1.13/ - *
Adding Search Functionality to Perl Applications

update: I actually use HTML::Index to search my perl documentation.

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Building a search engine
by Abigail-II (Bishop) on Nov 13, 2003 at 16:13 UTC

Building Google like search engine would excellent

I'm sure it would, but your question feels a bit like you enter a hardware store and say "Hi, I want to build an airplane. Something like a Boeing 747 would be excellent, but anything and everything that has good value would help".

Your question is very open ended, and it could involve a lot of work. You could buy or otherwise get some technology, or you could start a project that will supply material for countless ph.D. and postdoc students.

But what have you done sofar, and what is the direction you want to be going? You should realize that if you want increase your chance in getting a useful answer, you should ask specific questions instead of open ones, and show what you have done sofar.

Abigail

[reply]

Re: Building a search engine
by meetraz (Hermit) on Nov 13, 2003 at 16:16 UTC

On the other hand, if you don't have access to a database, you could try parsing through all files using File::Find and a regexp... or you could build a text-based index to avoid going through all files.

The answer will really depend on how many HTML files you have, how often they change, how much traffic you get, what kind of searches you want to use, and what kind of performance you need.

Can you provide more information on what you're looking for ?

[reply]

Re: Building a search engine
by Purdy (Hermit) on Nov 13, 2003 at 19:37 UTC

I'm surprised no one's brought up Perlfect - that's what my predecessor setup for our Web site and I haven't had to monkey with it since. He even did some custom work with the indexing script to look within a database for material to index as well, but it looks like you don't even need to worry about that...

Peace,

Jason

[reply]

Re: Re: Building a search engine

by artist (Parson) on Nov 13, 2003 at 21:58 UTC

Thanks.
artist

[reply]

Re: Re: Re: Building a search engine

by zakzebrowski (Curate) on Nov 14, 2003 at 13:15 UTC

Do you have access to a second machine? Build the index on one machine, and then scp the necessary files back to the host that runs the web page...? (Not sure if you can do that...)

BTW, (I know that you don't have access to a database, but) someone mentioned above that you could do keyword searching by creating an appropriate interafce in mysql. Additionally, mysql (and oracle) have full content / full text search on text / varchar / clob fields. You then just build a content index (exercise left to the student), and then when you do the insert you (should) be able to do a full text search on that table. (You may need to "rebuild" an index to get it work, but again, it's left as an exercise to the student.) The basic idea is to have a "CONTAINS" clause, which specifies if the document contains the following words, bring back a 'match score' for each document... Google search result:free text php/mysql tutorial

----
Zak

undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\
+nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n
+utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print
+;
[download]

[reply]
[d/l]

Re: Re: Building a search engine

by cfreak (Chaplain) on Nov 13, 2003 at 20:14 UTC

I'll second that. Perlfect is great. Its fast and reliable and very easy to set up.

Lobster Aliens Are attacking the world!

[reply]

Re: Building a search engine
by jmanning2k (Pilgrim) on Nov 13, 2003 at 19:24 UTC

Glimpse
htDig (shows recent signs of life again!)

[reply]

Re: Building a search engine
by Anonymous Monk on Nov 13, 2003 at 20:43 UTC

How to index a site? (Should the index be as compact as possible? Or maybe several different indices for different portions of the site for faster access?)
How to retrieve user query and return proper results from the index? (Do we offer phrase searching? Just keyword searching? Is the relevancy determined by keyword frequency or something else? Are the files all HTML?)
How to return the results to the user? (Display the page title and URL? What additional information needs to be displayed, like excerpt from the page?)

Perhaps #3 is the easiest portion provided you have the right index generated, but reading the tutorials and coming up with more concrete definition of a problem you're trying to solve should help.

Also, if Google has your site indexed in its entirety and frequently crawls it, it's not necessary to use their Web form freebie. You can always use Google API for full-blown searches (although that would limit you to 1,000 searches per day).

[reply]

Re: Building a search engine
by inman (Curate) on Nov 14, 2003 at 10:00 UTC

Be Spider Friendly!

Arggh! Building (or even running) a search engine is going to be a big effort and involve a fairly large amount of re-inventing of wheels. I work with one every day and keeping it up to date and working is a chore.

The quick and simple answer is that your best strategy will be to make your web site search engine friendly and then to piggy back of one of the commercial search engines (you don't have to use Google). A site is search engine friendly if the search engine spider has an easy time of getting around and doesn't have to follow too many deep links. Strangely enough, a site that is search spider friendly is also accessible by people who are visually disabled and using a text reader.

If you use an internet search engine, there is normally a search option that allows you to submit a search to an internet search engine such that it is limited to one domain (your website). If you have specific requirements, like wanting to ensure full coverage or having your own customised UI, you can pay a commercial search engine to index your site. My employers public website uses Atomz which seems to work well but is probably expensive.

The effort that you put into making your site spider friendly will pay dividends regardless of whether you implement a search engine yourself or get someone else to do it. The following links may be useful:

http://www.searchenginewatch.com/
http://www.searchtools.com/index.html - Notice the link to Perl based solutions

inman

[reply]

Re: Building a search engine
by danb (Friar) on Nov 14, 2003 at 17:19 UTC

Mike Heins added some nice search functionality to Interchange using Swish. Have you looked at Swish already?

-Dan

[reply]

Re: Building a search engine
by Anonymous Monk on Nov 15, 2003 at 11:53 UTC

This is Matt (mattr) but PM is not letting me log in for some reason today. Hope the server isn't ailing..

Many good comments here. I have a db >1GB and 60 databases (and a few more websites) running under htdig in a mod_perl wrapper, and you can generate parseable data apparently, so you might consider it. You can see my system in Omron's search box. Note: the category match function is my own code, not htdig. I have had to build lots of administrative tools though there are some contributed scripts which will show you how to use it. I had to hack source because the crawler listens to robots.txt even if you really don't want it to do so, which was annoying. The newest version (3.2.0b5 which sounds pretty solid though I am not running it) does phrase matching apparently. Basically if you are administering your own system you probably will have fun with htdig.

I believe you can install htdig even if you do not have root access (as I would assume since you cannot install a database).

Namazu is a perl-based engine with docs in English and Japanese and document converters. However the documentation is not voluminous, and it makes a huge directory full of indexes so is a bit opaque. Might be useful for personal use though. It is not high-performance and only indexes local files.

There is also mifluz which I have not used and is in a perenially beta status but might be interesting.

I'm not going to cover WAIS (Wide Area Information Search) and glimpse though someone mentioned it.

Also if you have MySQL that might be useful, though I have not tested their fulltext search yet. And you are not allowed to run a database apparently(?).

If you are going to program something on your own, you basically want to make some kind of inverted index (it can become quite involved though if you want phrase searching for example). But you may achieve some useful performance using a C/C++ database with a Perl API. Possibly even the InvertedIndex module above may be enough for you. But do consider there are lots of little things about searching that make programming your own system a major neverending project, even just things like parsing out html headers, the weighting of results, different searching and indexing techniques, giving higher priority to certain information sources, field-based searching, runtime memory requirements, logging of searches and referrers, and so on.

I think the htdig-type system is very difficult for non-tech people to administer but it has a lot in it, including several search methods like fuzzy/synonym/homonym/stemming, and exact tweaking of tons of variables, like how much to index and whether to notify the admin of changes. And it does ranking, which is really important.

I've done a number of search engines from tiny to about 1GB size and now I've gotten interested in two related areas: NLP (natural language processing) and a technique called faceted metadata search which uses knowledge of data structure to inform concurrent navigation and narrowing down search results. With 200,000 files this is important, since unless you know exactly what you are looking for, without a metadata technique you will likely either get back too many results or too few. You can check out Seamark which has an interesting white paper and Endeca which has a good flash-based wine search demo. I'm just mentioning this since there seem to be some people interesting in search engines around here and if you are building your own, then using structure information can greatly reduce your performance requirements.

By the way that article of Damian Conway's on text searching in a vector space is pretty neat. Anyway one of the links at the end of it is old - the Nitle link for semantic searching (which incidentally is the bridge between full text search and metadata search) is here. Why not experiment and tell us what you come up with?

Matt (mattr -at- telebody /dot/ com)

[reply]


Just another Perl shrine
	PerlMonks