http://qs321.pair.com?node_id=642787

princepawn has asked for the wisdom of the Perl Monks concerning the following question:

Ok, let's say file A has a series of strings, one per line. Let's say that file B has a series of strings, one per line.

The goal is, for each line in A, to return the best match from B using a subroutine named fuzzy_match, a function that takes two strings and returns a float from 0 to 1.

Now, let's assume that file B is enormous, making the prospect of applying fuzzy_match to each member infeasible. But let's also assume that the first character of each member of B will always be the best result from fuzzy_match for A. This means that instead of looking through all of B, you simply need to retrieve all records from B which start with the same first letter as the current record in A.

Hence you only search a "bucket" of B instead of all of B, saving you a good bit of time.

Now, I could write such an indexing / bucketing routine myself pretty easily, but I'm surprised that such a routine has not already been written. However CPAN showed no results for such a beast... any leads?

UPDATE

One thing to note is that the assumption about the two data sets should be parameterizable. In the description above, the assumption was that the first character of records which would correspond would be the same. But other assumptions are possible.

So, the best interface would be useable under a variety of hashing strategies... so the desired interface would be along the lines of:

my $hash = Data::Bucket->new; $hash->reflect->addSlots( hashing_strategy => sub { my $self = shift; my $string = shift; return substr(0, 2, $string) ; } ); while (<$search_file>) { $hash->store($_); } # then later for (<$in_file>) { my @bucket = $hash->bucket_for($_); my $best_match = fuzzy_match($_, @bucket); .. do something with best match ... }


Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Replies are listed 'Best First'.
Re: optimizing a linear search by Indexed or bucketed hashing
by GrandFather (Saint) on Oct 04, 2007 at 23:30 UTC

    Maybe it hasn't been written because the code is fairly trivial, but the details are highly application specific?

    Something like the following would seem to be what you need to do the bucketing:

    use strict; use warnings; my %buckets; my $lineStart = tell DATA; while (<DATA>) { chomp; next unless length; push @{$buckets{lc substr $_, 0, 1}}, [$lineStart, tell DATA]; $lineStart = tell DATA; } for my $key (sort keys %buckets) { my @pairs = map {"@$_"} @{$buckets{$key}}; print "$key: ", (join ', ', @pairs), "\n"; } __DATA__ Ok, let's say file A has a series of strings, one per line. Let's say +that file B has a series of strings, one per line. The goal is, for each line in A, to return the best match from B using + a subroutine named fuzzy_match, a function that takes two strings and re +turns a float from 0 to 1. Now, let's assume that file B is enormous, making the prospect of appl +ying fuzzy_match to each member infeasible. But let's also assume that the +first character of each member of B will always be the best result from fuzz +y_match for A. This means that instead of looking through all of B, you simply + need to retrieve all records from B which start with the same first letter as +the current record in A.

    Prints:

    b: 458 499 c: 822 900, 1053 1074 f: 651 670, 746 822, 900 979 n: 670 746 o: 378 458 r: 979 1053 s: 573 651 t: 499 573

    Perl is environmentally friendly - it saves trees
Re: optimizing a linear search by Indexed or bucketed hashing
by Cop (Initiate) on Oct 05, 2007 at 01:47 UTC

    This is like a simplified B-Tree, with hash you can easily do this yourself.

    Don't want to do this yourself, then put this in a database, it will do this for you. You should use database any way.

      actually it's more like an a-z tree ;).


      Perl is environmentally friendly - it saves trees

        You are obviously right ;-)