Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

diff of two strings

by flaviusm (Acolyte)
on Jan 07, 2008 at 15:45 UTC ( [id://660854]=perlquestion: print w/replies, xml ) Need Help??

flaviusm has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a program to highlight the changes between two manuscript collections. So, I am searching for the quickest algorithm/solution to mark the difference between two strings.

The differences should be marked following the rules:

- the difference of the two strings should be word based (not character based)
- new words should be marked between "<" and ">" (e.g. "<new_word>")
- common words that changed their position in the string should be placed between "[" and "]" (e.g. "[changed_place]")
- common words that kept their position in the string should be just copied to output

e.g.

--- original strings ---

Perlmonks is the best perl community
Perlmonks is one of the best community of perl users

--- marked strings ---

Perlmonks is the best [perl] [community]
Perlmonks is <one of> the best [community] <of> [perl] <users>

Current approach:
I currently use the LCSS dynamic algorithm to mark the longest common substring. I compare then the position of the LCSS within the two strings. If the position changed, I mark the substring with "[]", otherwise I leave it unmarked. I do the same for all common substrings. The substring(s) left after all LCSS operations, are considered new and are marked with "<>";

The algorithm is very slow and I have difficulties finding all common substrings between the two strings.

I would appreciate if you can guide me to a different solution, module, algorithm etc.

Thank you.

Replies are listed 'Best First'.
Re: diff of two strings
by rgiskard (Hermit) on Jan 07, 2008 at 19:14 UTC
    In the spirit of finding a solution (versus an algorithm), have you reviewed CPAN's Text::WordDiff? It appears capable of doing what you'd like. In the authors words:
    To diff source code, one is still best off using 
    Text::Diff. But if you want to see how a short story 
    changed from one version to the next, this module will
     do the job very nicely.

      rgiskard,

      I did not know about Text::WordDiff module, but I just installed it and tried it.

      The module does a great job identifying the differences of the two strings. However, the module seems to be lacking some functionality in regard to my specific problem

      Problems:
      - the result generated by the module is a single string color coded of the combined two strings => I need two strings (original strings) but with new/changed words tagged (see example in original posting).
      - I did not find a functionality in the module to give me an indication if a certain word is new or it just changed its position in the string.

      However, I will take a look in the module's source code, I might get an idea about their algorithm

      Thanks a lot

Re: diff of two strings
by lima1 (Curate) on Jan 07, 2008 at 19:36 UTC
    You want the Longest common subsequence, not substring. Below a simple modification of the wikipedia pseudocode of the dynamic programming algorithm (http://en.wikipedia.org/wiki/Longest_common_subsequence_problem) to make it word based.
    use strict; use warnings; use Data::Dumper; # word lists, first element a dummy my @s1 = (q{}, split(/s+/, "Perlmonks is the best perl community")); my @s2 = (q{}, split(/s+/, "Perlmonks is one of the best community of +perl users")); my @M; #init dyn. prog. matrix for my $i ( 0 .. $#s1) { $M[$i][0] = 0; } for my $i ( 0 .. $#s2) { $M[0][$i] = 0; } #calc lcs (word based) for my $i ( 1 .. $#s1) { for my $j ( 1 .. $#s2) { if ($s1[$i] eq $s2[$j]) { $M[$i][$j] = $M[$i-1][$j-1]+1; } else { if ($M[$i][$j-1] > $M[$i-1][$j]) { $M[$i][$j] = $M[$i][$j-1]; } else { $M[$i][$j] = $M[$i-1][$j]; } } } } #print Dumper \@M; printDiff($#s1, $#s2); sub printDiff { my ($i, $j) = @_; if ($i > 0 && $j > 0 and $s1[$i] eq $s2[$j]) { printDiff($i-1, $j-1); print " " . $s1[$i]; } else { if ($j > 0 && ($i == 0 || $M[$i][$j-1] >= $M[$i-1][$j] +)) { printDiff($i, $j-1); print " <" . $s2[$j] . ">"; } elsif ($i > 0 && ($j == 0 || $M[$i][$j-1] < $M[$i-1][$ +j])) { printDiff($i-1, $j); print " [" . $s1[$i] . "]"; } } }
    This outputs:
    Perlmonks is <one> <of> the best [perl] community <of> <perl> <us +ers>

      lima1,

      Thank you very much for taking time to modify wikipedia's pseudocode, but the output generated by your script doesn't seem to match the output presented in the original posting of the problem.

      Problem:
      - only one string displayed (instead of two)

      Can the above algorithm be modified to my match my proposed formating?

      Thank you.

        Sure. It is still not very polished but the algorithm should be quite optimal..Note that the brackets here are insertions and deletions, not new or common words! If you want this, I would mark these words in a postprocessing step. However, I think this information is not as important as insertions/deletions. They tell you exactly how to transform one string into the other. So I would use different colors for unique or common words.
        my @output; printDiff($#s1, $#s2); print join "\n", @output; sub printDiff { my ($i, $j) = @_; if ($i > 0 && $j > 0 and $s1[$i] eq $s2[$j]) { printDiff($i-1, $j-1); $output[0] .= " " . $s1[$i]; $output[1] .= " " . $s1[$i]; } else { if ($j > 0 && ($i == 0 || $M[$i][$j-1] >= $M[$i-1][$j] +)) { printDiff($i, $j-1); $output[1] .= " <" . $s2[$j] . ">"; } elsif ($i > 0 && ($j == 0 || $M[$i][$j-1] < $M[$i-1][$ +j])) { printDiff($i-1, $j); $output[0] .= " [" . $s1[$i] . "]"; } } }
        Perlmonks is the best [perl] community Perlmonks is <one> <of> the best community <of> <perl> <users>
Re: diff of two strings
by chaos_cat (Scribe) on Jan 07, 2008 at 22:01 UTC
    Interesting problem, I had fun playing with this one. Thanks for posting it. I'll say in advance, my results do not match the specified results 100%, specifically my code marks the words "the", "best", and "perl" as moved, since their absolute positions in the two strings is different. If you could elaborate on why those words should not be marked as moved, I can adjust the algorithm when I get home this evening.
    use strict; use warnings; use Data::Dumper; my $str1 = 'Perlmonks is the best perl community'; my $str2 = 'Perlmonks is one of the best community of perl users'; if ($str1 eq $str2) { print $str1; exit; } my @wl1 = split /\s+/, $str1; my @wl2 = split /\s+/, $str2; my $wp1 = build_word_hash(\@wl1); my $wp2 = build_word_hash(\@wl2); my $diff_str1 = ''; my $diff_str2 = ''; while (@wl1 || @wl2) { my $word1 = shift @wl1; my $word2 = shift @wl2; if ($word1 && $word2 && $word1 eq $word2) { $diff_str1 .= $word1 . ' '; # pairing the word from the origio +nal string with it's output $diff_str2 .= $word2 . ' '; # lets us do things like case inse +nsitive, but preserving match later shift @{$wp1->{$word1}}; # eat this word shift @{$wp2->{$word2}}; # eat this word next; } #process word1 first, for fun if ($word1) { if ($wp2->{$word1} && @{$wp2->{$word1}} && ! grep {$_ == $wp2->{$word1}->[0]} @{$wp1->{$word1}} ) +{ # word moved. # the grep checks that the next occurance of the word in s +tring 2 ($wp2->{$word}->[0] # does not also have an occurance of the word in string 1. # if it does not, it means that this is a move of the word +. $diff_str1 .= "[$word1] "; shift @{$wp2->{$word1}}; # eat this word } else { # Easy case, word in string 1 but not string 2 $diff_str1 .= "<$word1> "; } } if ($word2) { if ($wp1->{$word2} && @{$wp1->{$word2}} && ! grep {$_ == $wp1 +->{$word2}->[0]} @{$wp2->{$word2}} ) { $diff_str2 .= "[$word2] "; shift @{$wp1->{$word2}}; # eat this word } else { $diff_str2 .= "<$word2> "; } } } print "$diff_str1\n$diff_str2\n"; sub build_word_hash { my $wl = shift; my $res = {}; my $i = 0; foreach my $word( @$wl ) { push @{$res->{$word}} , $i++; } return $res; }
    results:
    Perlmonks is [the] [best] [perl] [community] Perlmonks is <one> <of> [the] [best] [community] <of> [perl] <users>
    Sorry for sloppy code, I was writeing this on my lunch break.

      chaos_cat,

      Thank you very much for taking time to "play" with my problem and also for the patience with my incomplete requirements.

      Some clarifications:
      - the new words will not modify the position in the string of the words that follow (this is true for the original string as well as for the modified string).

      e.g.
      ----

      original: Perlmonks is ..................... the best [perl] ...................[community]
      modified: Perlmonks is <one> <of> the best [community] <of> [perl] <users>

      Note: "the" is the 3rd word in the original and it is considered to be the 3rd word in the modified string also because "<one>" and "<of>" are new words and should be ignored (will not increase the position value)

      while

      original: ..... the best [perl] ....................[community]
      modified: ... the best [community] <of> [perl] <users>

      Note: "perl" and "community" should be marked as changed/moved, because they changed the position in the sentence.

      words ranks:
      ------------
      original: ......1.......2....................3....4.....[5].............[6]...........
      modified: ....1.......2..<7>..<8>...3....4.....[6]...<9>...[5]......<10>.

      Please let me know if you don't understant what I tried to explain above and I will explain it in different words.

      Thanks a lot.

        Ah, ok, that makes sense. Actually working with that idea has enabled me to clean the code up a lot. The code should run in O(N) time, with N being the sum of the word counts of the strings. It makes two passes over the strings, once to build the word count hashes and once to do the actual diff.
        use strict; use warnings; use Data::Dumper; my $str1 = 'Perlmonks is the best perl community'; my $str2 = 'Perlmonks is one of the best community of perl users'; if ( $str1 eq $str2 ) { print "$str1\n$str2\n"; exit; } my @wl1 = split /\s+/, $str1; my @wl2 = split /\s+/, $str2; my $diff_str1 = ''; my $diff_str2 = ''; my %wc1; my %wc2; foreach my $word (@wl1) { $wc1{$word}++; } foreach my $word (@wl2) { $wc2{$word}++; } while (@wl1 || @wl2) { my $word1 = ''; my $word2 = ''; # being sloppy and not decrementing word counts for new words # since we only use them for moves while (!$wc2{$word1}) { $diff_str1 .= "<$word1> " if $word1; $wc1{$word1}-- if $word1; $word1 = ''; # prevent fall through if this is the last word last if !@wl1; $word1 = shift @wl1; } while (!$wc1{$word2}) { $diff_str2 .= "<$word2> " if $word2; $wc2{$word2}-- if $word2; $word2 = ''; last if !@wl2; $word2 = shift @wl2; } if ( $word1 && $word2 && $word1 eq $word2 ) { $diff_str1 .= $word1 . ' '; # pairing the word from the ori +gional string with it's output $diff_str2 .= $word2 . ' '; # lets us do things like case i +nsensitive, but preserving match later } else { $diff_str1 .= "[$word1] " if $word1; $diff_str2 .= "[$word2] " if $word2; } $wc1{$word2}--; $wc2{$word1}--; } print "$diff_str1\n$diff_str2\n";
        output:
        Perlmonks is the best [perl] [community] Perlmonks is <one> <of> the best [community] <of> [perl] <users>
        The only weird case i found is if a word occurs in one string more than in the other, and they occur in the same position but not first. For example:
        $str1 = 'Perlmonks is the best perl perl community'; $str2 = 'Perlmonks is one of the best community of perl users'; --------- Perlmonks is the best [perl] <perl> [community] Perlmonks is <one> <of> the best [community] <of> [perl] <users>
        I'm not sure if that's correct or not by your standard. My first algorithm had accounted for this with the position hash, by looking ahead to see if a later occurrence of the word was in the same absolute position in the string. I did away with that since absolute position isn't in fact what you were interested in, but if you need to account for this case differently, something like that could be done, with an offset added to account for how the new words change the positions in the string.

        I'm somewhat curious what real world problem you're trying to solve with this. Depending on what you're doing, you might be able to get better results from a different methodology. For example, if you're trying to build a plagiarism detector, you might want to look into some kind of document similarity type algorithm.
        original: Perlmonks is ...... the best perl community modified: Perlmonks is one of the best .... community of perl users
        this is exactly what the LCS does. Everything else are trivial output/postprocessing rules. Not?
Re: diff of two strings
by starX (Chaplain) on Jan 07, 2008 at 16:10 UTC
    The Wikimedia engine does something like this. The way they mark changes is a little bit different, but you might try taking a look at the source and see if it gives you any clues. I'm anticipating a similar project in my not too distant future, so if you do come up with a good solution, please share :)
Re: diff of two strings
by sfink (Deacon) on Jan 09, 2008 at 08:29 UTC
    As described, I don't think there is a unique solution. Do you just want any valid solution, or one with some sort of minimum cost, or what?

    Consider (I'll use single letters to stand in for words): ABA vs BAA: <A>BA,BA<A>? Or [A][B]A,[B][A]A?

    I'm guessing you don't care too much, as long as there isn't an easy way to remove markup from the result (i.e., it's ok to return a local minimum even if it isn't the global minimum).

    For your algorithm above, I think what you really want is a suffix tree (of words) instead of the dynamic programming algorithm, but beyond that, my brain is too sleepy to be of much use. You'd find the LCSS, remove it from the suffix tree in some sense, and then I think you'd be able to immediately iterate to find the remaining LCSSes until you run out. I think the removal and all other operations should be pretty quick.

    Without a suffix tree... how slow would it be to run your existing LCSS algorithm, then replace the found LCSS with a different dummy token in each string and repeat? Or is that what you're already doing?

Re: diff of two strings
by flaviusm (Acolyte) on Jan 10, 2008 at 15:29 UTC

    chaos_cat

    I tried your program against one corpus of documents and it works great for many sentences. However, I detected a logical problem.

    considering a sentence of numbers, I get the following result (which is syntacticaly correct, but logicaly incorrect):

    01 <02> [03] [04] [03] [05] [06] [07] [08] [09] [10] [11] <03> 12 13 14 15 12 16
    01 ......... [04] [03] [05] [06] [07] [08] [09] [10] [11] [03] ........ 12 13 14 15 12 16

    instead of:

    01 <02> <03> 04 03 05 06 07 08 09 10 11 03 12 13 14 15 12 16
    01 .................. 04 03 05 06 07 08 09 10 11 03 12 13 14 15 12 16

    Is there a way to go around that problem? Thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://660854]
Approved by McDarren
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-23 18:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found