Re^3: Running SuperSearch off a fast full-text index.

KinoSearch 0.20's RangeFilters are mostly implemented in C and are optimized for low cost over multiple searches.

The first time you search with a sort or range constraint on a particular field, there is a hit as a cache has to be loaded. The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI) and can keep the Searcher object around for reuse.

Once the cache is loaded, RangeFilter is extremely fast. There's an initial burst of disk activity as numerical bounds are found, then the rest is all fetching values from the cache and if (locus < lower_bound) C integer comparison stuff -- no matter how many docs match. There's hardly any overhead added above what's required to match the rest of the query.

--
Marvin Humphrey
Rectangular Research ― http://www.rectangular.com

Comment on Re^3: Running SuperSearch off a fast full-text index. Download Code

Replies are listed 'Best First'.
Re^4: Running SuperSearch off a fast full-text index. by clinton (Priest) on Jun 11, 2007 at 17:53 UTC
The cache-loading can be significant with large indexes, but is only felt once if you are working in a persistent environment (mod_perl, FastCGI) Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no? And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines? thanks Clint	[reply]
Re^5: Running SuperSearch off a fast full-text index. by creamygoodness (Curate) on Jun 11, 2007 at 18:58 UTC
Does this mean that for mod_perl running the prefork MPM, each child process needs to load the cache? That must use a lot of memory, no? Yes, and KinoSearch is not thread safe. The memory requirements can be significant for large indexes, even though the data structures are not Perl's and attempts have been made to keep things compact. And how do you handle cache updates across all the child processes (whether they're on the same machine or different machines? A Searcher instance represents a snapshot of the index in time. Until you manually reload by creating a new Searcher, changes to the index are not visible. -- Marvin Humphrey Rectangular Research ― http://www.rectangular.com	[reply]
Re^6: Running SuperSearch off a fast full-text index. by clinton (Priest) on Jun 11, 2007 at 19:09 UTC
So maybe a reasonable solution would be: a separate mod_perl search server, which takes search requests from the web server and returns (eg) an XML or Soap list of IDs each child process checks (eg) a `last_cache_update` file once a minute to decide whether to reload the caches or not Clint	[reply] [d/l]
Re^7: Running SuperSearch off a fast full-text index. by creamygoodness (Curate) on Jun 11, 2007 at 19:23 UTC


Perl Monk, Perl Meditation
	PerlMonks