http://qs321.pair.com?node_id=620324


in reply to Running SuperSearch off a fast full-text index.

Or, seeing we already run on MySQl, why not try out the MySQL full text indexes again?

In a conversation with tye in the CB, he said that PM had already tried using MySQL's full text indexes, and that for searches such as 'perl', the DB just locked up.

My own experience with MySQL's full text indexes, is that they are very fast and very reliable, on sites with millions of documents. It may be that the version of MySQL that PM is using (which is 3.xx, if I remember correctly) may be somewhat less reliable, and be responsible for these problems.

I'm assuming that PM has a test suite that we could use to check how a newer version of MySQL works with the existing code.

I'm betting that upgrading to a newer version and adding full text indexes won't be an enormous task, probably easier than implementing Kinosearch (disclaimer: I have no experience of kinosearch)

Clint

Replies are listed 'Best First'.
Re^2: Running SuperSearch off a fast full-text index.
by clinton (Priest) on Jun 11, 2007 at 08:02 UTC
    I did a couple of searches for benchmarks comparing MySQL full text search and KinoSearch / Lucene.

    While there wasn't much on Kinosearch (see KinoSearch vs Lucene indexer benchmarks), I did find this comparison between MySQL full text (plus some plugins) and Lucene - see High-Performance-FullText-Search.pdf, which indicates that Lucene is the clear winner in their comparisons.

    Also see this PDF for a nice introdution to KinoSearch.

    One of the things I like about the MySQL full text, is its integration into the main database, so that adding where clauses based on other columns is easy. But reading the benchmarks, this appears to have a significant deleterious effect on performance, so perhaps it isn't such a clever idea after all.

    I see that KinoSearch 0.20 has range searches and filters - I'd be interested in knowing what effect these have on performance.

    Clint

      KinoSearch 0.20's RangeFilters are mostly implemented in C and are optimized for low cost over multiple searches.

      The first time you search with a sort or range constraint on a particular field, there is a hit as a cache has to be loaded. The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI) and can keep the Searcher object around for reuse.

      Once the cache is loaded, RangeFilter is extremely fast. There's an initial burst of disk activity as numerical bounds are found, then the rest is all fetching values from the cache and if (locus < lower_bound) C integer comparison stuff -- no matter how many docs match. There's hardly any overhead added above what's required to match the rest of the query.

      --
      Marvin Humphrey
      Rectangular Research ― http://www.rectangular.com
        The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI)

        Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no?

        And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines?

        thanks

        Clint

Re^2: Running SuperSearch off a fast full-text index.
by holli (Abbot) on Jun 10, 2007 at 16:52 UTC
    I'm assuming that PM has a test suite that we could use to check how a newer version of MySQL works with the existing code.
    Assumptions are dangerous :)


    holli, /regexed monk/
      Ahhh, so as far as writing tests goes, its a case of Do as I say, not as I do... :)
Re^2: Running SuperSearch off a fast full-text index.
by dmitri (Priest) on Jun 10, 2007 at 20:19 UTC
    I heard about this feature of MySQL, but I have never used it. How does it compare to KinoSearch in terms of search results relevance?
      MySQL full text search is simple way to make your tables searchable. However, it doesn't really compare well with specialized full-text indexes which always do better job.

      Aside from not quite stellar performance, it has fixed stop word list and depends upon MySQL which is (IMHO!) one moving part too much to index monetary.

      Sole purpose of this project is full-text search. Let's not go into over-engineering problem: we have slow RDBMS searches right now :-)


      2share!2flame...