Re^2: Hash versus substitutation efficiency?

In option one, it searches for all terms and every variation individually. Option two converts all synonyms of terms to their respective root prior to gathering word frequencies. In short, would it be more efficient to search 100,000 hash keys in option one, or use option two to do global substitutions on the documents so as to reduce the hash to only 10,000 keys?

Basically, I have a large set of documents and want to categorize them and be able to explore and search based on terms. Terms would include root terms and their synonyms. For example, "meat" as a root term could mean "ham", "beef", or "chicken". This could be stored in a hash so synonyms all lead to the root term:

$termHash{ "ham" }  = "meat";
$termHash{ "beef" } = "meat";
[download]

In this way, different terms that mean or relate to basically the same thing will be counted all as occurrences of the base term. When searching for "ham", one might see documents that contain "beef" or "meat" as well as "ham".

The numbers are a bit overwhelming:

26 million documents
10,000 base terms
100,000 synonyms

This will run on a cluster where each node works independently on a set of the documents. It's looking like the term hash will be a built separately and stored as either a persistent hash or an xml file.

Comment on Re^2: Hash versus substitutation efficiency? Download Code

Replies are listed 'Best First'.
Re^3: Hash versus substitutation efficiency? by Roy Johnson (Monsignor) on Oct 12, 2004 at 15:51 UTC
If I understand correctly, you want to build a searchable index of your documents, with 10,000 distinct indexable terms, and 100,000 synonyms for those terms. The global substitution seems like a bad way to go, and it's not clear to me how you're going to only do 10,000 subs. It seems like you should have 100,000 keys and 10,000 values in your %termhash. I think what you'd want to do is, given a synonym hash, make a reverse mapping that keys the unique values to a regex that alternates among all the keys. `my %synhash = ( meat => 'meat', ham => 'meat', beef => 'meat'); my %revsynhash; # Collect the synonyms as an array while (my ($k,$v) = each %synhash) { push @{$revsynhash{$v}}, $k; } # Turn the arrays into alternative-lists, longest first while (my $k = each %revsynhash) { $revsynhash{$k} = join '\|', sort {length $b <=> length $a} @{$revsyn +hash{$k}}; } # Now you have # %revsynhash = ( meat => 'beef\|meat\|ham')` [download] So you run your indexing scheme using the 10,000 items in %revsynhash, the keys being the indexed term, while the values are the regexes to count as hits for those terms. When someone wants to look up a term, you use %synhash to translate it, and hit your index for what it translated to. Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^4: Hash versus substitutation efficiency? by bwelch (Curate) on Oct 12, 2004 at 16:05 UTC
Neat solution, Roy! Thank you!	[reply]


"be consistent"
	PerlMonks