No such thing as a small change | |
PerlMonks |
KinoSearch & Large Documentsby TedYoung (Deacon) |
on Feb 10, 2007 at 15:06 UTC ( [id://599356]=perlquestion: print w/replies, xml ) | Need Help?? |
TedYoung has asked for the wisdom of the Perl Monks concerning the following question: Greetings, I have a large CMS that I just upgraded from Plucene to KinoSearch. Whenever the site changes, I crawl it and index the various pages and files.When I try to index a file on the order of 3 MB, the $index->add_doc($doc) command spins the CPU for over 30 minutes before completing. The fields are speced as:
In these cases all of the fields, except content, are nominal. Content may be several MBs in size. In the one example of this problem, I use wvWare to convert a 3MB DOC to text ($content). ($indexer is an object that extracts information from some source file).
At 3 MB, the CPU spins at add_doc for over 30 wallclock minutes before I give up. If I substring $content smaller, the time reduces (almost exponentially). 512 KB is on the order of 2 minutes. My server is a dual 3 GHz Intel with 1 GB DDR2 RAM. So, am I doing something wrong? Or am I nuts for trying to index a 3MB document at a time? Thanks, UPDATE: creamygoodness said I should look for use of $&, $`, and $'. A quick search against my codebase revealed that some of my older code (+6 years now) was using it. At the time, I thought, "Well, if I am going to accept the penalty, I may as well use it!" And I did. I didn't realize the penalty was so severe. In fact, it really never came up until I was trying to tokenize a 3 MB string. By removing all uses of $&, $`, and $', I took indexing time for an entire site (with a couple of 3 MB docs) down from over an hour, to under a minute!!! I will add a warning about this to my note How to build a Search Engine.. This is one of those cases where I wish I could ++ creamygoodness more than once. Maybe we need a new field added to our profiles; "Where to send beer". Thanks again! Ted Young ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
Back to
Seekers of Perl Wisdom
|
|