http://qs321.pair.com?node_id=660970


in reply to Re^2: diff of two strings
in thread diff of two strings

Sure. It is still not very polished but the algorithm should be quite optimal..Note that the brackets here are insertions and deletions, not new or common words! If you want this, I would mark these words in a postprocessing step. However, I think this information is not as important as insertions/deletions. They tell you exactly how to transform one string into the other. So I would use different colors for unique or common words.
my @output; printDiff($#s1, $#s2); print join "\n", @output; sub printDiff { my ($i, $j) = @_; if ($i > 0 && $j > 0 and $s1[$i] eq $s2[$j]) { printDiff($i-1, $j-1); $output[0] .= " " . $s1[$i]; $output[1] .= " " . $s1[$i]; } else { if ($j > 0 && ($i == 0 || $M[$i][$j-1] >= $M[$i-1][$j] +)) { printDiff($i, $j-1); $output[1] .= " <" . $s2[$j] . ">"; } elsif ($i > 0 && ($j == 0 || $M[$i][$j-1] < $M[$i-1][$ +j])) { printDiff($i-1, $j); $output[0] .= " [" . $s1[$i] . "]"; } } }
Perlmonks is the best [perl] community Perlmonks is <one> <of> the best community <of> <perl> <users>

Replies are listed 'Best First'.
Re^4: diff of two strings
by flaviusm (Acolyte) on Jan 09, 2008 at 05:17 UTC

    lima1,

    The presented algorithm is great but as I mentioned before it does not fit the proposed problem. I agree that for many situations insertions/deletions are very important, but for this specific application they are useless.

    The output syntax presented in the first post is not flexible. The two strings should be marked only based on unchanged/new/moved status of each word.

    I am afraid that the solution is beyond the simple use of LCS.

    Note: In my first version of my application I used LCS (subsequence) but it failed for some cases where some sequences appeared multiple times and were part of other subsequences as well.

    P.S. I appreciate very much your help and time and I try to keep you close to the proposed requirements so that your time and effort will not be useless for my problem.

      flaviusm,

      your problem is in principle NP-hard. However, the hardest part of the problem is to determine whether a word is "unmoved" (these are the unmarked ones in my output, see also Re^3: diff of two strings). This can be done in O(nm) with this dynamic programming ansatz. How the words are finally printed out in printDiff based for example on uniqueness (insertion, but only in the second string) or moved position (in both strings, but a insertion/deletion) are IMHO trivial changes.

      I do not understand your note and I still think the LCS is the way to go. Do you understand the algorithm? If yes, then please tell my why or when the LCS fails.

      If you really want to use common substrings to construct a skeleton, try Tree::Suffix, which is also mentioned in my String::LCSS_XS module. But keep in mind that these modules are character based (so they are much slower without any re-encoding tricks) and that the free suffix tree/array implementations are in general not as good as they could be. Optimizing the lcs algorithm in C is really easy. See the wikipedia article for details. With the linear memory modification, the data should even fit in the cache for not too large strings.