Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Search a database for every permutation of a list of words

by jfrancis (Initiate)
on Jun 18, 2002 at 06:48 UTC ( [id://175294]=perlquestion: print w/replies, xml ) Need Help??

jfrancis has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a small search engine where search queries will be in the form of questions or statements. For example "I have a receding hairline" or "what should I use for a receding hairline?". I intend for the search query to be filtered such that only the relevant words are left ("receding hairline") and the other irrelevant words used to formulate the question or statement are discarded. What will remain is a list of 1 or more words to be used to retrieve appropriate results from a database of keywords. My question pertains to how I would go about searching the database for every possible permutation of the remaining string of words. For example, if I am left with the string "receding hairline", I need to search the database for "receding", "hairline", "receding hairline" and "hairline receding".
  • Comment on Search a database for every permutation of a list of words

Replies are listed 'Best First'.
Re: Search a database for every permutation of a list of words
by ariels (Curate) on Jun 18, 2002 at 07:16 UTC
    IF you just want to rank documents in terms of how many query terms (@term = qw{hairline receding} in your example) each document contains, you don't need to do any permuting.

    Note that there are various trade-offs of memory, disk space, and search time that you can make. You'll need to analyze carefully what a "typical" query is like, and what a "bad" query is like, and take appropriate measures.

    Here's A Way To Do It. I assume you have (or will manufacture) a list of all search terms.

    • Create for each search term a list of the document IDs in which it appears.
    • (Untested, of course...)
      my %hit; foreach my $t (@term) { my @doc = $documents_mentioning{$t}; $hit{$_}++ for (@doc); } # Filter (keys %hits) by # of hits (at least 2), then sort. my @res = sort { $hit{$b} <=> $hit{$a} } (grep {$hit{$_} >= 2} keys %hit);

    "Interesting" parts include dealing with very large indexes (if your collection of documents is large), stemming words, and selecting the relevant words.

    A good alternative might be to get a 3rd-party search engine instead of writing your own.

Re: Search a database for every permutation of a list of words
by DamnDirtyApe (Curate) on Jun 18, 2002 at 09:02 UTC
    #! /usr/bin/perl use strict ; use Algorithm::Permute ; use Data::Dumper ; my @words_raw = qw/ receding hairline bald / ; # First, check for & remove duplicate words in the list. my @words = () ; { my %wc = () ; foreach ( @words_raw ) { push @words, $_ unless $wc{$_}++ ; } } # Cardinality of powerset of @words = 2**n, # where n = |@words|. my $cardinality = 2 ** @words ; my $places = ( log( $cardinality ) / log( 2 ) ) ; # Generate binary templates for all subsets of @words. my $fmt_str = "%0${places}b" ; my @set_templates = map { sprintf $fmt_str, $_ } ( 0 .. $cardinality - 1 ) ; my @power_set = () ; # Replace the template values with the contents of @words. # This will form the power set of @words. foreach ( @set_templates ) { my @flags = split // ; my @temp_set = () ; for ( my $i = 0 ; $i < @flags ; $i++ ) { push @temp_set, $words[$i] if $flags[$i] ; } push @power_set, \@temp_set ; } my @all_iterations = () ; # Drop the empty set from the power set (it's irrelevant # in the context of a search engine. shift @power_set ; # Find the set of possible permutations for each set in # the power set of @words. foreach my $set ( @power_set ) { my $p = new Algorithm::Permute $set ; while ( my @res = $p->next ) { push @all_iterations, join ' ', @res ; } } # Voila! print Dumper( \@all_iterations ) ;

    _______________
    D a m n D i r t y A p e
    Home Node | Email
Re: Search a database for every permutation of a list of words
by mpeppler (Vicar) on Jun 18, 2002 at 14:10 UTC
    When you say "database", do you mean a SQL database?

    If so then you need to do some additional work to index the database (unless you happen to have a full text search engine, of course).

    For a very simple system you can do something like this:

    create table words ( doc_id int -- or numeric? , word varchar(32) )
    For each document, extract the words (skipping things like "I", "the", etc.), and store each (doc_id. word) combination in the words table.

    Now you can search your documents with something like this:

    select d.doc_text, d.doc_id from documents d , words w where w.doc_id = d.doc_id and w.word in (<list of words to match>)
    Now you have all the documents that have each of the words that you want to search on - you can then apply additional logic to find phrases (i.e. "receeding hairline") for example.

    To help rank documents you can expand this by adding either a count column and/or a position to the words table. The position column is the offset of this word in the document, and you can use this to handle "near" queries, and also to rank documents.

    Michael

Re: Search a database for every permutation of a list of words
by rdfield (Priest) on Jun 18, 2002 at 15:07 UTC
    InterMedia from Oracle does all that and more. Ships with the standard product and installs with the starter database. The documentation is a bit woolly to say the least, but there's plenty of informative articles on Metalinks. InterMedia has a pretty decent PL/SQL interface so access via DBI is straightforward.

    rdfield

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://175294]
Approved by dwiz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2024-04-24 03:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found