Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Friend monks,

Sometimes there is a need to find out how similar two strings are. I hereby present one of the most useful and intuitive metrics for string similarity - the Levenshtein distance.

There is a Text::Levenshtein module on CPAN, but it is not well documented. My implementation, on the other hand, explains the algorithm step by step. As they say - "sometimes the best way to understand how the wheel works is to reinvent the wheel".

Take a look, the algorithm is very interesting - a neat hack, I'd say.

#!/usr/local/bin/perl -w use strict; ($#ARGV == 1) or die "Usage: $0 <string1> <string2>\n"; my ($s1, $s2) = (@ARGV); print "The Levenshtein distance between $s1 and $s2 is: " . levenshtei +n($s1, $s2) . "\n"; # Return the Levenshtein distance (also called Edit distance) # between two strings # # The Levenshtein distance (LD) is a measure of similarity between two # strings, denoted here by s1 and s2. The distance is the number of # deletions, insertions or substitutions required to transform s1 into # s2. The greater the distance, the more different the strings are. # # The algorithm employs a promimity matrix, which denotes the # distances between substrings of the two given strings. Read the # embedded comments for more info. If you want a deep understanding # of the algorithm, printthe matrix for some test strings # and study it # # The beauty of this system is that nothing is magical - the distance # is intuitively understandable by humans # # The distance is named after the Russian scientist Vladimir # Levenshtein, who devised the algorithm in 1965 # sub levenshtein { # $s1 and $s2 are the two strings # $len1 and $len2 are their respective lengths # my ($s1, $s2) = @_; my ($len1, $len2) = (length $s1, length $s2); # If one of the strings is empty, the distance is the length # of the other string # return $len2 if ($len1 == 0); return $len1 if ($len2 == 0); my %mat; # Init the distance matrix # # The first row to 0..$len1 # The first column to 0..$len2 # The rest to 0 # # The first row and column are initialized so to denote distance # from the empty string # for (my $i = 0; $i <= $len1; ++$i) { for (my $j = 0; $j <= $len2; ++$j) { $mat{$i}{$j} = 0; $mat{0}{$j} = $j; } $mat{$i}{0} = $i; } # Some char-by-char processing is ahead, so prepare # array of chars from the strings # my @ar1 = split(//, $s1); my @ar2 = split(//, $s2); for (my $i = 1; $i <= $len1; ++$i) { for (my $j = 1; $j <= $len2; ++$j) { # Set the cost to 1 iff the ith char of $s1 # equals the jth of $s2 # # Denotes a substitution cost. When the char are equal # there is no need to substitute, so the cost is 0 # my $cost = ($ar1[$i-1] eq $ar2[$j-1]) ? 0 : 1; # Cell $mat{$i}{$j} equals the minimum of: # # - The cell immediately above plus 1 # - The cell immediately to the left plus 1 # - The cell diagonally above and to the left + the cost # # We can either insert a new char, delete a char of # substitute an existing char (with an associated cost) # $mat{$i}{$j} = min([$mat{$i-1}{$j} + 1, $mat{$i}{$j-1} + 1, $mat{$i-1}{$j-1} + $cost]); } } # Finally, the distance equals the rightmost bottom cell # of the matrix # # Note that $mat{$x}{$y} denotes the distance between the # substrings 1..$x and 1..$y # return $mat{$len1}{$len2}; } # minimal element of a list # sub min { my @list = @{$_[0]}; my $min = $list[0]; foreach my $i (@list) { $min = $i if ($i < $min); } return $min; }

In reply to Levenshtein distance: calculating similarity of strings by spurperl

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-19 13:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found