Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Building a search engine

by artist (Parson)
on Nov 13, 2003 at 15:57 UTC ( #306819=perlquestion: print w/replies, xml ) Need Help??

artist has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am looking to build or use a search engine for my Internet Website. I searched here and at CPAN but found no useful results. (May be I am missing right terms in CPAN search). Search will be for only HTML files. Search engine has to search different set of files for queries coming from different customers. The platform could be Windows or Unix using Apache server (no mod_perl) if that matters.

Building Google like search engine would excellent but anything and everything that has good value would help. I don't want to put Google front-end on my website.

Thus, what I am looking for is exisiting search mechanism or CPAN modules that helps me to build a search engine. Of course, I am open to any other suggestions.

Thank you,
{artist}

Update: Please note that I do not have access to database for this purpose. My files are not changed. They are added/deleted on daily base. Volume: Around 10000 files

Update2:(20031114)

  • Volume : total around 200,000 files and 5000 files are added daily.. Some are removed at periodical interval.
  • Perlfect Issues:.No incremental/differential index. And it takes large amount of time to build the index. So cannot search on recently updated things. The data has to be published ASAP. So cannot wait for indexing to finish before publishing. If someone has idea: How to encorporate differential indexing with perlfect, I would really appreciate it.
  • Cannot use Google. The main reason is I have password protected sites and not everybody should be able see everybody else's contents. Think: -> Building a search system for mail accounts.

20031113 Edit by Corion: Fixed unclosed blockquote

Replies are listed 'Best First'.
Re: Building a search engine
by valdez (Monsignor) on Nov 13, 2003 at 16:24 UTC
Re: Building a search engine
by PodMaster (Abbot) on Nov 13, 2003 at 16:16 UTC
Re: Building a search engine
by Abigail-II (Bishop) on Nov 13, 2003 at 16:13 UTC
    Building Google like search engine would excellent

    I'm sure it would, but your question feels a bit like you enter a hardware store and say "Hi, I want to build an airplane. Something like a Boeing 747 would be excellent, but anything and everything that has good value would help".

    Your question is very open ended, and it could involve a lot of work. You could buy or otherwise get some technology, or you could start a project that will supply material for countless ph.D. and postdoc students.

    But what have you done sofar, and what is the direction you want to be going? You should realize that if you want increase your chance in getting a useful answer, you should ask specific questions instead of open ones, and show what you have done sofar.

    Abigail

Re: Building a search engine
by meetraz (Hermit) on Nov 13, 2003 at 16:16 UTC
    Do you have access to a database? One way to build a search engine would be to write an indexing script that pulls keywords out of your HTML files and puts them into a database, such as MySQL or SQLite. When a user searched for something, you would just use a database query.

    On the other hand, if you don't have access to a database, you could try parsing through all files using File::Find and a regexp... or you could build a text-based index to avoid going through all files.

    The answer will really depend on how many HTML files you have, how often they change, how much traffic you get, what kind of searches you want to use, and what kind of performance you need.

    Can you provide more information on what you're looking for ?

Re: Building a search engine
by Purdy (Hermit) on Nov 13, 2003 at 19:37 UTC

    I'm surprised no one's brought up Perlfect - that's what my predecessor setup for our Web site and I haven't had to monkey with it since. He even did some custom work with the indexing script to look within a database for material to index as well, but it looks like you don't even need to worry about that...

    Peace,

    Jason

      Perlfect is good and I tried. The problem with that is update. If I have over 100,000 files and adding a single file or as 200 files per day is a big problem. Because I have to re-index everything (ie..100200 files). Re-indexing everything takes lots of time. If there is anyway I can do incremental indexing or combine 2 indexing?. How do i go about having a list of files only to index with Perlfect ?

      Thanks.
      artist

        Do you have access to a second machine? Build the index on one machine, and then scp the necessary files back to the host that runs the web page...? (Not sure if you can do that...)

        BTW, (I know that you don't have access to a database, but) someone mentioned above that you could do keyword searching by creating an appropriate interafce in mysql. Additionally, mysql (and oracle) have full content / full text search on text / varchar / clob fields. You then just build a content index (exercise left to the student), and then when you do the insert you (should) be able to do a full text search on that table. (You may need to "rebuild" an index to get it work, but again, it's left as an exercise to the student.) The basic idea is to have a "CONTAINS" clause, which specifies if the document contains the following words, bring back a 'match score' for each document... Google search result:free text php/mysql tutorial



        ----
        Zak
        undef$/;$mmm="J\nutsu\nutss\nuts\nutst\nuts A\nutsn\nutso\nutst\nutsh\ +nutse\nutsr\nuts P\nutse\nutsr\nutsl\nuts H\nutsa\nutsc\nutsk\nutse\n +utsr\nuts";open($DOH,"<",\$mmm);$_=$forbbiden=<$DOH>;s/\nuts//g;print +;
Re: Building a search engine
by jmanning2k (Pilgrim) on Nov 13, 2003 at 19:24 UTC
    I have had excellent success in the past with both of the following:

    ~J

Re: Building a search engine
by Anonymous Monk on Nov 13, 2003 at 20:43 UTC
    In a nutshell you are trying to find the best solution to several different problems:
    1. How to index a site? (Should the index be as compact as possible? Or maybe several different indices for different portions of the site for faster access?)
    2. How to retrieve user query and return proper results from the index? (Do we offer phrase searching? Just keyword searching? Is the relevancy determined by keyword frequency or something else? Are the files all HTML?)
    3. How to return the results to the user? (Display the page title and URL? What additional information needs to be displayed, like excerpt from the page?)

    Perhaps #3 is the easiest portion provided you have the right index generated, but reading the tutorials and coming up with more concrete definition of a problem you're trying to solve should help.

    Also, if Google has your site indexed in its entirety and frequently crawls it, it's not necessary to use their Web form freebie. You can always use Google API for full-blown searches (although that would limit you to 1,000 searches per day).

Re: Building a search engine
by inman (Curate) on Nov 14, 2003 at 10:00 UTC
    Be Spider Friendly!

    Arggh! Building (or even running) a search engine is going to be a big effort and involve a fairly large amount of re-inventing of wheels. I work with one every day and keeping it up to date and working is a chore.

    The quick and simple answer is that your best strategy will be to make your web site search engine friendly and then to piggy back of one of the commercial search engines (you don't have to use Google). A site is search engine friendly if the search engine spider has an easy time of getting around and doesn't have to follow too many deep links. Strangely enough, a site that is search spider friendly is also accessible by people who are visually disabled and using a text reader.

    If you use an internet search engine, there is normally a search option that allows you to submit a search to an internet search engine such that it is limited to one domain (your website). If you have specific requirements, like wanting to ensure full coverage or having your own customised UI, you can pay a commercial search engine to index your site. My employers public website uses Atomz which seems to work well but is probably expensive.

    The effort that you put into making your site spider friendly will pay dividends regardless of whether you implement a search engine yourself or get someone else to do it. The following links may be useful:

    inman

Re: Building a search engine
by danb (Friar) on Nov 14, 2003 at 17:19 UTC

    Mike Heins added some nice search functionality to Interchange using Swish. Have you looked at Swish already?

    -Dan

Re: Building a search engine
by Anonymous Monk on Nov 15, 2003 at 11:53 UTC
    Hello,

    This is Matt (mattr) but PM is not letting me log in for some reason today. Hope the server isn't ailing..

    Many good comments here. I have a db >1GB and 60 databases (and a few more websites) running under htdig in a mod_perl wrapper, and you can generate parseable data apparently, so you might consider it. You can see my system in Omron's search box. Note: the category match function is my own code, not htdig. I have had to build lots of administrative tools though there are some contributed scripts which will show you how to use it. I had to hack source because the crawler listens to robots.txt even if you really don't want it to do so, which was annoying. The newest version (3.2.0b5 which sounds pretty solid though I am not running it) does phrase matching apparently. Basically if you are administering your own system you probably will have fun with htdig.

    I believe you can install htdig even if you do not have root access (as I would assume since you cannot install a database).

    Namazu is a perl-based engine with docs in English and Japanese and document converters. However the documentation is not voluminous, and it makes a huge directory full of indexes so is a bit opaque. Might be useful for personal use though. It is not high-performance and only indexes local files.

    There is also mifluz which I have not used and is in a perenially beta status but might be interesting.

    I'm not going to cover WAIS (Wide Area Information Search) and glimpse though someone mentioned it.

    Also if you have MySQL that might be useful, though I have not tested their fulltext search yet. And you are not allowed to run a database apparently(?).

    If you are going to program something on your own, you basically want to make some kind of inverted index (it can become quite involved though if you want phrase searching for example). But you may achieve some useful performance using a C/C++ database with a Perl API. Possibly even the InvertedIndex module above may be enough for you. But do consider there are lots of little things about searching that make programming your own system a major neverending project, even just things like parsing out html headers, the weighting of results, different searching and indexing techniques, giving higher priority to certain information sources, field-based searching, runtime memory requirements, logging of searches and referrers, and so on.

    I think the htdig-type system is very difficult for non-tech people to administer but it has a lot in it, including several search methods like fuzzy/synonym/homonym/stemming, and exact tweaking of tons of variables, like how much to index and whether to notify the admin of changes. And it does ranking, which is really important.

    I've done a number of search engines from tiny to about 1GB size and now I've gotten interested in two related areas: NLP (natural language processing) and a technique called faceted metadata search which uses knowledge of data structure to inform concurrent navigation and narrowing down search results. With 200,000 files this is important, since unless you know exactly what you are looking for, without a metadata technique you will likely either get back too many results or too few. You can check out Seamark which has an interesting white paper and Endeca which has a good flash-based wine search demo. I'm just mentioning this since there seem to be some people interesting in search engines around here and if you are building your own, then using structure information can greatly reduce your performance requirements.

    By the way that article of Damian Conway's on text searching in a vector space is pretty neat. Anyway one of the links at the end of it is old - the Nitle link for semantic searching (which incidentally is the bridge between full text search and metadata search) is here. Why not experiment and tell us what you come up with?

    Matt (mattr -at- telebody /dot/ com)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://306819]
Approved by Corion
Front-paged by monsieur_champs
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2020-08-14 23:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which rocket would you take to Mars?










    Results (77 votes). Check out past polls.

    Notices?