http://qs321.pair.com?node_id=620435


in reply to Re: Running SuperSearch off a fast full-text index.
in thread Running SuperSearch off a fast full-text index.

I did a couple of searches for benchmarks comparing MySQL full text search and KinoSearch / Lucene.

While there wasn't much on Kinosearch (see KinoSearch vs Lucene indexer benchmarks), I did find this comparison between MySQL full text (plus some plugins) and Lucene - see High-Performance-FullText-Search.pdf, which indicates that Lucene is the clear winner in their comparisons.

Also see this PDF for a nice introdution to KinoSearch.

One of the things I like about the MySQL full text, is its integration into the main database, so that adding where clauses based on other columns is easy. But reading the benchmarks, this appears to have a significant deleterious effect on performance, so perhaps it isn't such a clever idea after all.

I see that KinoSearch 0.20 has range searches and filters - I'd be interested in knowing what effect these have on performance.

Clint

  • Comment on Re^2: Running SuperSearch off a fast full-text index.

Replies are listed 'Best First'.
Re^3: Running SuperSearch off a fast full-text index.
by creamygoodness (Curate) on Jun 11, 2007 at 17:44 UTC

    KinoSearch 0.20's RangeFilters are mostly implemented in C and are optimized for low cost over multiple searches.

    The first time you search with a sort or range constraint on a particular field, there is a hit as a cache has to be loaded. The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI) and can keep the Searcher object around for reuse.

    Once the cache is loaded, RangeFilter is extremely fast. There's an initial burst of disk activity as numerical bounds are found, then the rest is all fetching values from the cache and if (locus < lower_bound) C integer comparison stuff -- no matter how many docs match. There's hardly any overhead added above what's required to match the rest of the query.

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com
      The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI)

      Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no?

      And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines?

      thanks

      Clint

        Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no?

        Yes, and KinoSearch is not thread safe. The memory requirements can be significant for large indexes, even though the data structures are not Perl's and attempts have been made to keep things compact.

        And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines?

        A Searcher instance represents a snapshot of the index in time. Until you manually reload by creating a new Searcher, changes to the index are not visible.

        --
        Marvin Humphrey
        Rectangular Research ― http://www.rectangular.com