Levenshtein distance: calculating similarity of strings

Friend monks,

Sometimes there is a need to find out how similar two strings are. I hereby present one of the most useful and intuitive metrics for string similarity - the Levenshtein distance.

There is a Text::Levenshtein module on CPAN, but it is not well documented. My implementation, on the other hand, explains the algorithm step by step. As they say - "sometimes the best way to understand how the wheel works is to reinvent the wheel".

Take a look, the algorithm is very interesting - a neat hack, I'd say.

#!/usr/local/bin/perl -w

use strict;

($#ARGV == 1) or die "Usage: $0 <string1> <string2>\n";

my ($s1, $s2) = (@ARGV);

print "The Levenshtein distance between $s1 and $s2 is: " . levenshtei
+n($s1, $s2) . "\n";



# Return the Levenshtein distance (also called Edit distance) 
# between two strings
#
# The Levenshtein distance (LD) is a measure of similarity between two
# strings, denoted here by s1 and s2. The distance is the number of
# deletions, insertions or substitutions required to transform s1 into
# s2. The greater the distance, the more different the strings are.
#
# The algorithm employs a promimity matrix, which denotes the 
# distances between substrings of the two given strings. Read the 
# embedded comments for more info. If you want a deep understanding 
# of the algorithm, printthe matrix for some test strings 
# and study it
#
# The beauty of this system is that nothing is magical - the distance
# is intuitively understandable by humans
#
# The distance is named after the Russian scientist Vladimir
# Levenshtein, who devised the algorithm in 1965
#
sub levenshtein
{
    # $s1 and $s2 are the two strings
    # $len1 and $len2 are their respective lengths
    #
    my ($s1, $s2) = @_;
    my ($len1, $len2) = (length $s1, length $s2);

    # If one of the strings is empty, the distance is the length
    # of the other string
    #
    return $len2 if ($len1 == 0);
    return $len1 if ($len2 == 0);

    my %mat;

    # Init the distance matrix
    #
    # The first row to 0..$len1
    # The first column to 0..$len2
    # The rest to 0
    #
    # The first row and column are initialized so to denote distance
    # from the empty string
    #
    for (my $i = 0; $i <= $len1; ++$i)
    {
        for (my $j = 0; $j <= $len2; ++$j)
        {
            $mat{$i}{$j} = 0;
            $mat{0}{$j} = $j;
        }

        $mat{$i}{0} = $i;
    }

    # Some char-by-char processing is ahead, so prepare
    # array of chars from the strings
    #
    my @ar1 = split(//, $s1);
    my @ar2 = split(//, $s2);

    for (my $i = 1; $i <= $len1; ++$i)
    {
        for (my $j = 1; $j <= $len2; ++$j)
        {
            # Set the cost to 1 iff the ith char of $s1
            # equals the jth of $s2
            # 
            # Denotes a substitution cost. When the char are equal
            # there is no need to substitute, so the cost is 0
            #
            my $cost = ($ar1[$i-1] eq $ar2[$j-1]) ? 0 : 1;

            # Cell $mat{$i}{$j} equals the minimum of:
            #
            # - The cell immediately above plus 1
            # - The cell immediately to the left plus 1
            # - The cell diagonally above and to the left + the cost
            #
            # We can either insert a new char, delete a char of
            # substitute an existing char (with an associated cost)
            #
            $mat{$i}{$j} = min([$mat{$i-1}{$j} + 1,
                                $mat{$i}{$j-1} + 1,
                                $mat{$i-1}{$j-1} + $cost]);
        }
    }

    # Finally, the distance equals the rightmost bottom cell
    # of the matrix
    #
    # Note that $mat{$x}{$y} denotes the distance between the 
    # substrings 1..$x and 1..$y
    #
    return $mat{$len1}{$len2};
}


# minimal element of a list
#
sub min
{
    my @list = @{$_[0]};
    my $min = $list[0];

    foreach my $i (@list)
    {
        $min = $i if ($i < $min);
    }

    return $min;
}
[download]

Comment on Levenshtein distance: calculating similarity of strings Download Code

Replies are listed 'Best First'.

Re: Levenshtein distance: calculating similarity of strings
by omnibus (Scribe) on Mar 24, 2003 at 16:34 UTC

here

[reply]

Re: Levenshtein distance: calculating similarity of strings
by PodMaster (Abbot) on Mar 24, 2003 at 18:11 UTC

There is a Text::Levenshtein module on CPAN, but it totally lacks documentation (what else is new !).

http://search.cpan.org/author/DREE/Text-Levenshtein-0.03/Levenshtein.pm

As for how the algorithm works, Text::Lavenshtein links to http://www.merriampark.com/ld.htm#FLAVORS which explains it rather well.

MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
I run a Win32 PPM repository for perl 5.6x+5.8x. I take requests.
** The Third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re: Re: Levenshtein distance: calculating similarity of strings

by spurperl (Priest) on Mar 25, 2003 at 06:27 UTC

When I said "what else is new" I didn't refer to CPAN modules, but the general practice of not commenting. I see how I could've been misunderstood
http://www.merriampark.com/ld.htm#FLAVORS indeed gives a step-by-step recipe to the algorithm, but if you read carefully you see that it doesn't go a bit into explaining why things are the way they are. My code does
To avoid further mis-understanding, I removed the offending phrase

[reply]

Re: Levenshtein distance: calculating similarity of strings
by hossman (Prior) on Mar 25, 2003 at 01:57 UTC

In addition to Text::Levenshtein (whose docs look short, but sufficient to me) Jarkko has a great module that does the same thing, but with a lot more fine control: String::Approx.

It provides an LRU cache for what he calls the "matching engines" of a pattern (so that testing the distance between $string1 and $string3 doesn't incur a bunch of overhead you allready spent on calculating hte distance between $string1 and $string2), it provides lots of usefull methods that utilize the Levenshtein Distance to do splicing, slicing, or finding the location of substring ... and he provides modifiers that let you say things like "allow only inserts or substitutions of characters, but no deleters" or tell me if these strings are 90% similar"

[reply]

Re: Levenshtein distance: calculating similarity of strings
by goldbug (Initiate) on Mar 27, 2003 at 16:55 UTC

String::Trigram

[reply]

Re^2: Levenshtein distance: calculating similarity of strings

by Anonymous Monk on Jul 07, 2014 at 13:30 UTC

hello everyone,

I am trying to get the separate results from the Levenshtein distance, namely the number of deleted, inserted and replaced characters. Do you have any idea?

Thanks in advance.

[reply]

Re^3: Levenshtein distance: calculating similarity of strings

by Anonymous Monk on Jul 08, 2014 at 02:18 UTC

Hi :) You're asking goldbug was last here 11 years ago :) and can't hear you. See on necromancy (reviving old threads) and The Perl Monks Guide to the Monastery

If you want to ask everybody go to new question: Seekers of Perl Wisdom

OTOH, going to cpan is always a good idea, Levenshtein deleted, Levenshtein deleted

[reply]

Re^4: Levenshtein distance: calculating similarity of strings

by Anonymous Monk on Apr 17, 2017 at 16:53 UTC

Re: Levenshtein distance: calculating similarity of strings
by Anonymous Monk on Feb 21, 2018 at 22:04 UTC

This code uses a hash instead of a simple array for mat. I changed it to a regular array, and it reduced the execution speed by 19.3%.

[reply]

Re^2: Levenshtein distance: calculating similarity of strings

by AnomalousMonk (Archbishop) on Feb 21, 2018 at 22:31 UTC

... it reduced the execution speed ...

So just to be clear, are you saying your change made the code slower by about 20%?

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]

Re^3: Levenshtein distance: calculating similarity of strings

by Anonymous Monk on Feb 22, 2018 at 04:27 UTC

Oops, sorry. It reduced execution time by about 20%. I'm not allowed to post my actual code, but it was just the few obvious changes: declare @mat rather than %mat, and use [] rather than {}.

I made a loop that called levenshtein with 2 arrays of strings of 2-character words. Between one and 25 words per string, comparing all possible combinations of 51 different such "sentences". Hence 51 * 51 = 2061 string comparisons of between one and 25 words per sentence. I executed it on an old MacBook Pro 200 times, and took the average "user" time from the "time" command.

Average "user" time was 19.31% shorter.

[reply]

Re^4: Levenshtein distance: calculating similarity of strings

by AnomalousMonk (Archbishop) on Feb 22, 2018 at 04:53 UTC


Pathologically Eclectic Rubbish Lister
	PerlMonks