Finding a _Similar_ Substring? (Fuzzy Searching?)

rjahrman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by duff (Parson) on May 21, 2004 at 02:35 UTC
Try String::Approx duff	[reply]
Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by BUU (Prior) on May 21, 2004 at 02:52 UTC
Actually, reading your requirements, it sounds like a better solution might be to define a list of characters that "don't matter" when you're matching (or doing whatever you want to do). An easy way to do this would be something like: `my @ignore=(' ','-'); #whatever for(@ignore){ s/$_//g; } #match against $_` [download]	[reply] [d/l]
Re: Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by hv (Prior) on May 21, 2004 at 03:04 UTC
Since in this type of situation I'd normally expect the one pattern to be matched against many strings, I'd usually aim to approach this instead by modifying the regexp: `my @ignore=(' ','-'); #whatever my $ignoreclass = sprintf '[%s]', join '', map quotemeta, @ignore; $re = join $ignoreclass, split //, $re;` [download] Of course this is only so simple if the initial pattern is a simple string: a full-on regexp is rather more difficult to introduce such modifications to reliably. Hugo	[reply] [d/l]
Re: Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by TomDLux (Vicar) on May 21, 2004 at 03:14 UTC
If your ignore set are too complicated for character classes, you can OR them together into a regex. I doubt it would be necessary here, more likely for sets fo words. `my $ignoreStrings = join "\|", @ignore; my $deleteThese = qr/$ignoreStrings/g; $strting =~ s/$deleteThese//;` [download] By the way, you're using $_ to represent the various elements of @ignore, but also to denote the default object of s///. That's why I tend to avoid defaults .... better to be explicit, self-documenting, and avoid irritating errors. -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply] [d/l]
Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by BrowserUk (Patriarch) on May 21, 2004 at 03:21 UTC
Depending upon how loose you want the criteria to be, you might get away with something like this. `my $term = 'P100'; ## my $re = qr[@{[ join '\W', split '', $term ]}]; # Improved slightl +y. my $re = qr[@{[ join '\W', map "\Q$_\E", split '', $term ]}]x; for( 'P100', 'P-100', 'P 100', 'P1 00', 'the P 100 is very similar in style to the P-101 & P102.'. 'The P-100 is a generation behind the P1000' ) { print "Matched $1" while m[\b($re)\b]g; };; Matched P100 Matched P-100 Matched P 100 Matched P1 00 Matched P 100 Matched P-100` [download] You could also add /i if you want case insensitivity. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l]
Re: Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by rjahrman (Scribe) on May 21, 2004 at 04:03 UTC
What exactly are you doing in the regexes at the top? What's the difference between the first and second one?	[reply]
Re: Finding a _Similar_ Substring? (Fuzzy Searching?) by ambrus (Abbot) on May 21, 2004 at 11:47 UTC
If, as others have suggested, you want most characters get ignored, you could strip all those characters (with y///d) from both the haystack and the needle string, and then perform a match. Also, you may want to use case-insensitive matching.	[reply]


We don't bite newbies here... much
	PerlMonks