Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Fuzzy Searching: Optimizing Algorithm Selection

by ikegami (Patriarch)
on Nov 11, 2004 at 00:14 UTC ( [id://406855]=note: print w/replies, xml ) Need Help??


in reply to Fuzzy Searching: Optimizing Algorithm Selection

My solution builds a regexp that looks like the following, minus the lines breaks:

^ (?:a|(?(?{ local $fuzzies = $fuzzies; !$fuzzies-- })(?=A)(?=Z)).) (?:p|(?(?{ local $fuzzies = $fuzzies; !$fuzzies-- })(?=A)(?=Z)).) (?:p|(?(?{ local $fuzzies = $fuzzies; !$fuzzies-- })(?=A)(?=Z)).) (?:l|(?(?{ local $fuzzies = $fuzzies; !$fuzzies-- })(?=A)(?=Z)).) (?:e|(?(?{ local $fuzzies = $fuzzies; !$fuzzies-- })(?=A)(?=Z)).) (?{ $fuzzies_left = $fuzzies }) $

Here it is:

our $USE_A_FUZZY = '(?(?{ local $fuzzies = $fuzzies; !$fuzzies-- })(?=A)(?=Z))'; our $COUNT_FUZZIES_LEFT = '(?{ $fuzzies_left = $fuzzies })'; { my $string = 'apple'; my $max_fuzzies = 2; my $re = '^'; $re .= sprintf("(?:%s|${USE_A_FUZZY}.)", quotemeta($1)) while ($string =~ /(.)/g); $re .= "${COUNT_FUZZIES_LEFT}\$"; $re = do { use re 'eval'; qr/$re/ }; our $fuzzies; local $fuzzies; our $fuzzies_left; local $fuzzies_left; while (<DATA>) { chomp; $fuzzies = $max_fuzzies; if (/$re/) { $fuzzies = $max_fuzzies - $fuzzies_left; if ($fuzzies) { print("Fuzzy match with $_ ($fuzzies fuzzies).$/"); } else { print("Exact match with $_.$/"); } } else { print("No match with $_.$/"); } } } __DATA__ apple arple brple brqle orange ------ OUTPUT ------ Exact match with apple. Fuzzy match with arple (1 fuzzies). Fuzzy match with brple (2 fuzzies). No match with brqle. No match with orange.

Don't know if it's faster.

Known Bugs: 'apples' doesn't count as a fuzzy match for 'apple', but it would count as a fuzzy match for 'apple '. I can fix that if you want.

Update: Since we never have any backtracking, we can simplify the above to:

^ (?:a|(?(?{!$fuzzies--})(?=A)(?=Z)).) (?:p|(?(?{!$fuzzies--})(?=A)(?=Z)).) (?:p|(?(?{!$fuzzies--})(?=A)(?=Z)).) (?:l|(?(?{!$fuzzies--})(?=A)(?=Z)).) (?:e|(?(?{!$fuzzies--})(?=A)(?=Z)).) $

As implemented by the following code:

our $USE_A_FUZZY = '(?(?{!$fuzzies--})(?=A)(?=Z))'; { my $string = 'apple'; my $max_fuzzies = 2; my $re = '^'; $re .= sprintf("(?:%s|${USE_A_FUZZY}.)", quotemeta($1)) while ($string =~ /(.)/g); $re .= "\$"; $re = do { use re 'eval'; qr/$re/ }; our $fuzzies; local $fuzzies; while (<DATA>) { chomp; $fuzzies = $max_fuzzies; if (/$re/) { $fuzzies = $max_fuzzies - $fuzzies; if ($fuzzies) { print("Fuzzy match with $_ ($fuzzies fuzzies).$/"); } else { print("Exact match with $_.$/"); } } else { print("No match with $_.$/"); } } }

Replies are listed 'Best First'.
Re^2: Fuzzy Searching: Optimizing Algorithm Selection
by tall_man (Parson) on Nov 11, 2004 at 00:32 UTC
    Since his dictionary strings are variable-length, I don't think you can use begin and end anchors. For example, "I ate an apple." should match, but it doesn't with your code.

      I must have missed that part of the problem description. The first version should work fine without the anchors.

      There is a problem with removing the anchors however (and not just in my solution, I think). It would be possible to find a match with fuzzies while there's an exact match later on in the string.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://406855]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-04-16 17:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found