Search a database for every permutation of a list of words

jfrancis has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Search a database for every permutation of a list of words by ariels (Curate) on Jun 18, 2002 at 07:16 UTC
IF you just want to rank documents in terms of how many query terms (`@term = qw{hairline receding}` in your example) each document contains, you don't need to do any permuting. Note that there are various trade-offs of memory, disk space, and search time that you can make. You'll need to analyze carefully what a "typical" query is like, and what a "bad" query is like, and take appropriate measures. Here's A Way To Do It. I assume you have (or will manufacture) a list of all search terms. Create for each search term a list of the document IDs in which it appears. (Untested, of course...) `my %hit; foreach my $t (@term) { my @doc = $documents_mentioning{$t}; $hit{$_}++ for (@doc); } # Filter (keys %hits) by # of hits (at least 2), then sort. my @res = sort { $hit{$b} <=> $hit{$a} } (grep {$hit{$_} >= 2} keys %hit);` [download] "Interesting" parts include dealing with very large indexes (if your collection of documents is large), stemming words, and selecting the relevant words. A good alternative might be to get a 3rd-party search engine instead of writing your own.	[reply] [d/l] [select]
Re: Search a database for every permutation of a list of words by DamnDirtyApe (Curate) on Jun 18, 2002 at 09:02 UTC
#! /usr/bin/perl use strict ; use Algorithm::Permute ; use Data::Dumper ; my @words_raw = qw/ receding hairline bald / ; # First, check for & remove duplicate words in the list. my @words = () ; { my %wc = () ; foreach ( @words_raw ) { push @words, $_ unless $wc{$_}++ ; } } # Cardinality of powerset of @words = 2n, # where n = \|@words\|. my $cardinality = 2 @words ; my $places = ( log( $cardinality ) / log( 2 ) ) ; # Generate binary templates for all subsets of @words. my $fmt_str = "%0${places}b" ; my @set_templates = map { sprintf $fmt_str, $_ } ( 0 .. $cardinality - 1 ) ; my @power_set = () ; # Replace the template values with the contents of @words. # This will form the power set of @words. foreach ( @set_templates ) { my @flags = split // ; my @temp_set = () ; for ( my $i = 0 ; $i < @flags ; $i++ ) { push @temp_set, $words[$i] if $flags[$i] ; } push @power_set, \@temp_set ; } my @all_iterations = () ; # Drop the empty set from the power set (it's irrelevant # in the context of a search engine. shift @power_set ; # Find the set of possible permutations for each set in # the power set of @words. foreach my $set ( @power_set ) { my $p = new Algorithm::Permute $set ; while ( my @res = $p->next ) { push @all_iterations, join ' ', @res ; } } # Voila! print Dumper( \@all_iterations ) ; [download] _______________ D a m n D i r t y A p e Home Node \| Email	[reply] [d/l]
Re: Search a database for every permutation of a list of words by mpeppler (Vicar) on Jun 18, 2002 at 14:10 UTC
When you say "database", do you mean a SQL database? If so then you need to do some additional work to index the database (unless you happen to have a full text search engine, of course). For a very simple system you can do something like this: `create table words ( doc_id int -- or numeric? , word varchar(32) )` [download] For each document, extract the words (skipping things like "I", "the", etc.), and store each (doc_id. word) combination in the `words` table. Now you can search your documents with something like this: `select d.doc_text, d.doc_id from documents d , words w where w.doc_id = d.doc_id and w.word in (<list of words to match>)` [download] Now you have all the documents that have each of the words that you want to search on - you can then apply additional logic to find phrases (i.e. "receeding hairline") for example. To help rank documents you can expand this by adding either a `count` column and/or a `position` to the `words` table. The position column is the offset of this word in the document, and you can use this to handle "near" queries, and also to rank documents. Michael	[reply] [d/l] [select]
Re: Search a database for every permutation of a list of words by rdfield (Priest) on Jun 18, 2002 at 15:07 UTC
InterMedia from Oracle does all that and more. Ships with the standard product and installs with the starter database. The documentation is a bit woolly to say the least, but there's plenty of informative articles on Metalinks. InterMedia has a pretty decent PL/SQL interface so access via DBI is straightforward. rdfield	[reply]


"be consistent"
	PerlMonks