Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Ok, let's say file A has a series of strings, one per line. Let's say that file B has a series of strings, one per line.

The goal is, for each line in A, to return the best match from B using a subroutine named fuzzy_match, a function that takes two strings and returns a float from 0 to 1.

Now, let's assume that file B is enormous, making the prospect of applying fuzzy_match to each member infeasible. But let's also assume that the first character of each member of B will always be the best result from fuzzy_match for A. This means that instead of looking through all of B, you simply need to retrieve all records from B which start with the same first letter as the current record in A.

Hence you only search a "bucket" of B instead of all of B, saving you a good bit of time.

Now, I could write such an indexing / bucketing routine myself pretty easily, but I'm surprised that such a routine has not already been written. However CPAN showed no results for such a beast... any leads?


One thing to note is that the assumption about the two data sets should be parameterizable. In the description above, the assumption was that the first character of records which would correspond would be the same. But other assumptions are possible.

So, the best interface would be useable under a variety of hashing strategies... so the desired interface would be along the lines of:

my $hash = Data::Bucket->new; $hash->reflect->addSlots( hashing_strategy => sub { my $self = shift; my $string = shift; return substr(0, 2, $string) ; } ); while (<$search_file>) { $hash->store($_); } # then later for (<$in_file>) { my @bucket = $hash->bucket_for($_); my $best_match = fuzzy_match($_, @bucket); .. do something with best match ... }

Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

In reply to optimizing a linear search by Indexed or bucketed hashing by princepawn

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others examining the Monastery: (5)
    As of 2020-07-14 17:44 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found