Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

measuring the similarity between strings

by mayaTheCat (Scribe)
on Sep 05, 2003 at 10:10 UTC ( #289152=snippet: print w/replies, xml ) Need Help??
Description: hi monks, here comes another implementation of the Levenshtein Distance; a distance metric which measures the similarities of strings.

Info and some other implementations can be found at Levenshtein distance: calculating similarity of strings.

At the moment I need to compare two sets of strings where each set consists of approximately 2000 strings, which means approximataley 3-4 million measurements. Thus I wrote this code, in which I tried to minimize the memory usage, to make it more efficient.

sub ldist {
    my @s = split //, shift;
    my @t = split //, shift;

    return scalar @t if scalar @s == 0;
    return scalar @s if scalar @t == 0;

    my (@prevColumn, @currColumn);

    @prevColumn = 0..scalar(@t);

    for my $s (0..$#s) {
        @currColumn = ( $s + 1 );
        for my $t (0..$#t) {
            push @currColumn, min(
                $currColumn[$t] + 1
                , $prevColumn[$t+1] + 1
                , $prevColumn[$t] + ($s[$s] eq $t[$t] ? 0 : 1)
        @prevColumn = @currColumn;

    pop @currColumn

sub min {
    my $min = shift;

    for (@_) {
        $min = $_ if $_ < $min;

Replies are listed 'Best First'.
Re: measuring the similarity between strings
by dree (Monsignor) on Sep 05, 2003 at 12:12 UTC
    To make more efficient you can also try Text::LevenshteinXS that is 1000% faster than plain Text::Levenshtein.
Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: snippet [id://289152]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2021-12-01 15:45 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (13 votes). Check out past polls.