Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

KinoSearch, or alternatives, on parsing?

by punch_card_don (Curate)
on Mar 14, 2010 at 16:08 UTC ( [id://828587]=perlquestion: print w/replies, xml ) Need Help??

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Mudlark Monks,

So I've been reading about KinoSearch with a view to putting together a new search engine for a website, and it sounds really great - but, unless I've misunderstood, it appears it does not do any document parsing. That is, I will have to write a document parser to pass the content of my flies to KinoSearch for building the inverted index.

That was one of the nice things about old Perlfect Search - it had a crawler and a parser built in. (In fact sometimes I wonder why I don't just go back to Perlfect...)

Anyway - I've got everything on this site - html, shtml, pdf, doc, xls....am I really going to have to re-invent the wheel here? Or are there any good multi-format parsers out there?

OR, is there something as good as Kino that includes a parsing module/

Thanks.




Time flies like an arrow. Fruit flies like a banana.
  • Comment on KinoSearch, or alternatives, on parsing?

Replies are listed 'Best First'.
Re: KinoSearch, or alternatives, on parsing?
by r1n0 (Beadle) on Jan 18, 2011 at 17:58 UTC
    Hello,
    I am research KinoSearch items, today, and came across this post. I have a couple comments for your question. I am no guru, but what I have done with my implementation of KinoSearch is the following:
  • Wrote script to perform KinoSearch indexing on a file
  • Downloaded and installed Apache Tika
  • Have Tika convert files to text version of files
  • Inform KinoSearch index script to pull in files
  • If file has a text brother, then index the text brother, instead of the raw file
    Has been very successful, so far.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://828587]
Approved by Perlbotics
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-04-25 14:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found