go ahead... be a heretic PerlMonks

### Re: diff of two strings

by lima1 (Curate)
 on Jan 07, 2008 at 19:36 UTC ( #660914=note: print w/replies, xml ) Need Help??

in reply to diff of two strings

You want the Longest common subsequence, not substring. Below a simple modification of the wikipedia pseudocode of the dynamic programming algorithm (http://en.wikipedia.org/wiki/Longest_common_subsequence_problem) to make it word based.
```use strict;
use warnings;

use Data::Dumper;

# word lists, first element a dummy
my @s1 = (q{}, split(/s+/, "Perlmonks is the best perl community"));
my @s2 = (q{}, split(/s+/, "Perlmonks is one of the best community of
+perl users"));

my @M;

#init dyn. prog. matrix
for my \$i ( 0 .. \$#s1) {
\$M[\$i][0] = 0;
}

for my \$i ( 0 .. \$#s2) {
\$M[0][\$i] = 0;
}

#calc lcs (word based)
for my \$i ( 1 .. \$#s1) {
for my \$j ( 1 .. \$#s2) {
if (\$s1[\$i] eq \$s2[\$j]) {
\$M[\$i][\$j] = \$M[\$i-1][\$j-1]+1;
}
else {
if (\$M[\$i][\$j-1] > \$M[\$i-1][\$j]) {
\$M[\$i][\$j] = \$M[\$i][\$j-1];
}
else {
\$M[\$i][\$j] = \$M[\$i-1][\$j];
}
}
}
}

#print Dumper \@M;
printDiff(\$#s1, \$#s2);

sub printDiff {
my (\$i, \$j) = @_;
if (\$i > 0 && \$j > 0 and \$s1[\$i] eq \$s2[\$j]) {
printDiff(\$i-1, \$j-1);
print "  " . \$s1[\$i];
}
else {
if (\$j > 0 && (\$i == 0 || \$M[\$i][\$j-1] >= \$M[\$i-1][\$j]
+)) {
printDiff(\$i, \$j-1);
print " <" . \$s2[\$j] . ">";
}
elsif (\$i > 0 && (\$j == 0 || \$M[\$i][\$j-1] < \$M[\$i-1][\$
+j])) {
printDiff(\$i-1, \$j);
print " [" . \$s1[\$i] . "]";
}
}
}
This outputs:
``` Perlmonks  is <one> <of>  the  best [perl]  community <of> <perl> <us
+ers>

Replies are listed 'Best First'.
Re^2: diff of two strings
by flaviusm (Acolyte) on Jan 07, 2008 at 20:55 UTC

lima1,

Thank you very much for taking time to modify wikipedia's pseudocode, but the output generated by your script doesn't seem to match the output presented in the original posting of the problem.

Problem:
- only one string displayed (instead of two)

Can the above algorithm be modified to my match my proposed formating?

Thank you.

Sure. It is still not very polished but the algorithm should be quite optimal..Note that the brackets here are insertions and deletions, not new or common words! If you want this, I would mark these words in a postprocessing step. However, I think this information is not as important as insertions/deletions. They tell you exactly how to transform one string into the other. So I would use different colors for unique or common words.
```my @output;
printDiff(\$#s1, \$#s2);
print join "\n", @output;

sub printDiff {
my (\$i, \$j) = @_;
if (\$i > 0 && \$j > 0 and \$s1[\$i] eq \$s2[\$j]) {
printDiff(\$i-1, \$j-1);
\$output[0] .= "  " . \$s1[\$i];
\$output[1] .= "  " . \$s1[\$i];
}
else {
if (\$j > 0 && (\$i == 0 || \$M[\$i][\$j-1] >= \$M[\$i-1][\$j]
+)) {
printDiff(\$i, \$j-1);
\$output[1] .= " <" . \$s2[\$j] . ">";
}
elsif (\$i > 0 && (\$j == 0 || \$M[\$i][\$j-1] < \$M[\$i-1][\$
+j])) {
printDiff(\$i-1, \$j);
\$output[0] .= " [" . \$s1[\$i] . "]";
}
}
}
```  Perlmonks  is  the  best [perl]  community
Perlmonks  is <one> <of>  the  best  community <of> <perl> <users>

lima1,

The presented algorithm is great but as I mentioned before it does not fit the proposed problem. I agree that for many situations insertions/deletions are very important, but for this specific application they are useless.

The output syntax presented in the first post is not flexible. The two strings should be marked only based on unchanged/new/moved status of each word.

I am afraid that the solution is beyond the simple use of LCS.

Note: In my first version of my application I used LCS (subsequence) but it failed for some cases where some sequences appeared multiple times and were part of other subsequences as well.

P.S. I appreciate very much your help and time and I try to keep you close to the proposed requirements so that your time and effort will not be useless for my problem.

Create A New User
Node Status?
node history
Node Type: note [id://660914]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2021-04-18 08:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?