Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Writing inverted index code in perl might be overkill

by dpavlin (Friar)
on Aug 18, 2005 at 15:52 UTC ( [id://484831]=note: print w/replies, xml ) Need Help??


in reply to Writing a Search Engine in Perl?

I would suggest against writing your own inverted index code. There are lot of good full-text index engines out there... For perl-only implementation, see Plucene or KinoSearch.

For hybrid C-perl combination I would suggest http://hyperestraier.sourceforge.net/ for which I plan to write perl-only P2P API (help appriciated).

Somewhat off-topic, but there is also (shameless plug) http://pgfoundry.org/projects/pgestraier/ for quering HyperEstraier index directly from PostgreSQL to have best of both worlds: structured data in PostgreSQL which is joinable with full-text results from HyperEstraier. It will probably include P2P API in near future.


2share!2flame...
  • Comment on Writing inverted index code in perl might be overkill

Replies are listed 'Best First'.
Re: Writing inverted index code in perl might be overkill
by eric256 (Parson) on Aug 18, 2005 at 16:25 UTC

    When I saw P2P i started thinking of a massive P2P effort to index the web...Use spare processor power from computers around the world to index the web. If you had that kind of power you could do more interesting indexs of documents, i wonder however if it could truly rival google or yahoo's indexers. Kind of like a dmoz.org or something. Just thinking out loud, don't mind me.


    ___________
    Eric Hodges
      The problem that I see with a solution like this is that many would misuse it - depending on what part(s) of the system would be shared. I'm thinking of reverse engineering that could reveal how to be ranked better.
      That's actually possible using existing horisontal scalability of HyperEstraier.

      Just setup multiple servers which crawl separate parts of web. Setup search to search over all nodes at once.
      Indexer can query search index to find out if some other indexer did crawl that page already (and optionally refresh content if needed). That way, you will have fresher pages with bigger number of incomming links (which you can count and use that also in page ranking - I hope that this idea doesn't violate Google patent).

      I don't have pointer to perl solution for this (other than CPAN modules which make every problem 90% done). On the other hand, with current P2P architecture you can have multiple indexes (for e-mail, documents, etc.) and search over just some or all of them.


      2share!2flame...
Why? - Writing inverted index code in perl might be overkill
by techcode (Hermit) on Aug 18, 2005 at 16:42 UTC
    Eh nice link I made (if that is the right word in English anyway) ...

    Why do you think that writing reverse index in Perl would be overkill? And in what meaning it would be an overkill?

    Sure C/Perl combination might be considered - I could finally put C/C++ knowledge gained on advanced school to practical use ...

      Only down-side to perl only version is speed. Of course, it depends on size of your input data. However, on my laptop I have more data that I want to index than any perl-only solution really can handle (over 20Gb in various formats).

      I have some expiriences with WAIT (and some pending patches at http://svn.rot13.org/~dpavlin/svnweb/index.cgi/wait/log/trunk/ ), swish-e, Xapian (another great engine which updated perl bindings few days ago). I also experimented with CLucene perl bindings and finally ended with HyperEstraier.

      I would suggent to make list of requirements of search engine and then select right one. My current list include:

      • full text search
      • filter results by attributes (e.g. date, category...)
      • ability to update index content while running searches on it
      • wildcard support (or substring, even better!)
      • acceptable speed on projected amount of data
      Last point influence choice very much. I would go with Plucene if data size is small enough (or only for prototyping).

      Writing good parsers and analyzers for input formats (do you want to rank bold words more than surround text?) and font-end is hard enough without writing own reverse index implementation, especially since some very good allready exist.


      2share!2flame...
        In my experience, Plucene was not very good at handling your third requirement, "ability to update index content while running searches on it." The code that handles file locking is prone to die instead of wait. Not good for live websites. The following ASCII depicts my expression upon discovering this:

        8-[

        YMMV

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://484831]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-19 23:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found