Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

KinoSearch & Large Documents

by TedYoung (Deacon)
on Feb 10, 2007 at 15:06 UTC ( [id://599356]=perlquestion: print w/replies, xml ) Need Help??

TedYoung has asked for the wisdom of the Perl Monks concerning the following question:

Greetings,

I have a large CMS that I just upgraded from Plucene to KinoSearch. Whenever the site changes, I crawl it and index the various pages and files.

When I try to index a file on the order of 3 MB, the $index->add_doc($doc) command spins the CPU for over 30 minutes before completing.

The fields are speced as:

$index->spec_field(name => 'url', analyzed => 0, vectorized => 0); $index->spec_field(name => 'filetype', indexed => 0, analyzed => 0, ve +ctorized => 0); $index->spec_field(name => 'title', boost => 3, vectorized => 0); $index->spec_field(name => 'section', boost => 3, vectorized => 0); $index->spec_field(name => 'content');

In these cases all of the fields, except content, are nominal. Content may be several MBs in size. In the one example of this problem, I use wvWare to convert a 3MB DOC to text ($content). ($indexer is an object that extracts information from some source file).

my $doc = $index->new_doc; $doc->set_value(url => $url || ''); $doc->set_value(title => $indexer->title || ''); $doc->set_value(filetype => $indexer->fileType || ''); $doc->set_value(section => $indexer->section || ''); my $content = $indexer->content; $content =~ s/ / /g; $content =~ s/\xA0/ /g; $doc->set_value(content => $content || ''); $index->add_doc($doc);

At 3 MB, the CPU spins at add_doc for over 30 wallclock minutes before I give up. If I substring $content smaller, the time reduces (almost exponentially). 512 KB is on the order of 2 minutes.

My server is a dual 3 GHz Intel with 1 GB DDR2 RAM.

So, am I doing something wrong? Or am I nuts for trying to index a 3MB document at a time?

Thanks,


UPDATE:

creamygoodness said I should look for use of $&, $`, and $'. A quick search against my codebase revealed that some of my older code (+6 years now) was using it. At the time, I thought, "Well, if I am going to accept the penalty, I may as well use it!" And I did.

I didn't realize the penalty was so severe. In fact, it really never came up until I was trying to tokenize a 3 MB string.

By removing all uses of $&, $`, and $', I took indexing time for an entire site (with a couple of 3 MB docs) down from over an hour, to under a minute!!! I will add a warning about this to my note How to build a Search Engine..

This is one of those cases where I wish I could ++ creamygoodness more than once. Maybe we need a new field added to our profiles; "Where to send beer". Thanks again!

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

Replies are listed 'Best First'.
Re: KinoSearch & Large Documents
by creamygoodness (Curate) on Feb 11, 2007 at 04:57 UTC

    3MB shouldn't be any trouble at all. Indexing time increases roughly linearly with the length of the text. (Once KinoSearch's internal caches start getting flushed the numbers get noisy, but that happens every 10-15MB of content and it's unlikely to be the problem.)

    Here's a benchmarking result demonstrating the relationship:

    $ perl time_to_add_doc.plx s/iter size_4M size_2M size_1M size_500k size_250k size_4M 16.4 -- -51% -77% -87% -94% size_2M 8.00 105% -- -53% -73% -87% size_1M 3.73 340% 114% -- -41% -72% size_500k 2.20 646% 264% 70% -- -52% size_250k 1.06 1450% 656% 253% 108% -- $

    One possibility is that that somebody, somewhere has used one of the hateful match variables: $' $` and $&. Their appearance anywhere in your script or its dependencies will completely destroy the performance of KinoSearch's Tokenizer, which runs a short regex over a large string many times in a tight loop.

    It's shocking how awful things get, and indeed, the degradation is geometric. Check out Devel::SawAmpersand for an explanation and the sawampersand function which you can use to investigate. Old versions of Text::Balanced are known to cause this problem. (Maybe I should have Tokenizer's constructor issue a warning if Devel::SawAmpersand::sawampersand() returns true.)

    Another possibility is that you're hitting swap. It sounds like you've got adequate RAM on that box, and KS itself doesn't need a whole lot -- 30MB or so, not factoring in space occupied by the current doc. But it's worth my asking about for the sake of completeness, since the symptoms are consistent with that diagnosis, too.

    If neither of those help, try running the script I used to generate the benchmarking output above. If it produces results in line with mine, that suggests that the problem lies elsewhere in your app.

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      You may have hit the nail on the head. Even without Devel::SawAmpersand, a quick search of my 250,000 line codebase found several $&'s. That's bad! But I didn't realize it was that significant. I will update my OP with the results. Thanks.

      Ted Young

      ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
Re: KinoSearch & Large Documents
by Khen1950fx (Canon) on Feb 10, 2007 at 16:59 UTC
    I think that you need to "finalize" it:

    $index->finish;

      I do finalize and optimize it after I am done with the indexing. But, in the case of these large docs, I don't even get that far. It is the add_doc method that sucks up the CPU.

      Ted Young

      ($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
        Humphrey suggests adding a Stopalizer. I haven't tried it yet, but it might help. See:

        Load Problem

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://599356]
Approved by Joost
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-23 06:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found