Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Comparing two text files and marking differences

by tybalt89 (Monsignor)
on Jan 30, 2021 at 17:34 UTC ( [id://11127694]=note: print w/replies, xml ) Need Help??


in reply to Comparing two text files and marking differences

Here's your examples run through a word diff'r by color I had laying around. Maybe this could be a starting point for solving your problem.

#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11127688 use warnings; use Algorithm::Diff qw(traverse_sequences); use Term::ANSIColor; while( <DATA> ) { my @from = split /(\s+)/; my @to = split /(\s+)/, <DATA>; traverse_sequences( \@from, \@to, { MATCH => sub {print $from[shift()]}, DISCARD_A => sub {print color('red'), $from[shift()], color 'reset +'}, DISCARD_B => sub {print color('green'), $to[pop()], color 'reset'} +, } ); print "\n\n"; } __DATA__ A few moments will suffice to commit it to memory; yet the period whic +h it covers, commencing more than twenty-five centuries ago, reaches +on from that far-distant point past the rise and fall of kingdoms, pa +st the setting up and overthrow of empires, past cycles and ages, pas +t our own day, over into the eternal state. A few moments will suffice to commit it to memory, yet the period whic +h it covers, beginning more than twenty-five centuries ago, reaches f +rom that far-distant point past the rise and fall of kingdoms, past t +he setting up and overthrow of empires, past cycles and ages, past ou +r own day, to the eternal state. Now opens one of the sublimest chapters of human history. Now opens one of the most comprehensive of the histories of world empi +res. With what interest, as well as astonishment, must the king have listen +ed, as he was informed by the prophet that he, or rather his kingdom, + the king being here put for his kingdom (see the following verse), w +as the golden head of the magnificent image which he had seen. With what interest and astonishment must the king have listened as he +was informed by the prophet that his kingdom was the golden head of t +he magnificent image.

Replies are listed 'Best First'.
Re^2: Comparing two text files and marking differences
by Polyglot (Chaplain) on Jan 31, 2021 at 07:20 UTC

    I appreciate all of the answers, and found that I was able to make some modifications to this one in particular which seems to be yielding most of what I want. I'm still doing a few post-subroutine substitutions to clear up some text formatting issues, but the following subroutine does the bulk of what needed to be done.

    sub comparator { my $str1 = shift @_; my $str2 = shift @_; my $original = ''; my $revised = ''; my @from = split(/((?:<[^>]+>)+|(?:\s)+|(?:\w[A-Za-z'-]*\w*)+|(?:\W|\P +{IsWord})|(?:\p{IsDigit}))/, $str1); my @to = split(/((?:<[^>]+>)+|(?:\s)+|(?:\w[A-Za-z'-]*\w*)+|(?:\W|\P +{IsWord})|(?:\p{IsDigit}))/, $str2); my $OS = qq|<span class="m">|; my $OE = qq|</span> |; my $RS = qq|<span class="hl">|; my $RE = qq|</span> |; traverse_sequences( \@from, \@to, { MATCH => sub { my $oldtext = $from[shift()]; $original .= $old +text; $revised .= $oldtext }, DISCARD_A => sub { my $oldtext = $from[shift()]; if ($oldtext =~ m +/(?:\p{IsPunct})|(?:\s)/) {$original .= $oldtext } else { $original . += $OS.$oldtext.$OE } }, DISCARD_B => sub { my $newtext = $to[pop()]; if ($newtext =~ m +/(?:\p{IsPunct})|(?:\s)/) {$revised .= $newtext } else { $revised . += $RS.$newtext.$RE } }, } ); return ($original, $revised); } #END SUB comparator

    I have never found the output of a standard diff to be very enlightening. I'm sure it works well to change files, patch-style, but it isn't very readable for someone simply wanting to see what happened to the text in a side-by-side format. This procedure is making a visual inspection much easier, with the help of some HTML markup.

    Thank you!

    Blessings,

    ~Polyglot~

      I have never found the output of a standard diff to be very enlightening. I'm sure it works well to change files, patch-style, but it isn't very readable for someone simply wanting to see what happened to the text in a side-by-side format.

      Plain old diff (in the GNU version) has at least four output formats:

      • ed script:
        /tmp>diff foo bar 1,2c1,2 < Bla bla. Foo bar baz. < Nada nada nada. Nada? --- > Bla bar. Foo bar baz. > Nada na-da nada. Nada? 4c4 < bar. Bla. Bar bla. --- > bar. Bla bar bla.
      • Unified:
        /tmp>diff -u foo bar --- foo 2021-01-31 15:13:16.892239748 +0100 +++ bar 2021-01-31 15:13:43.403869518 +0100 @@ -1,6 +1,6 @@ -Bla bla. Foo bar baz. -Nada nada nada. Nada? +Bla bar. Foo bar baz. +Nada na-da nada. Nada? Foo foo foo! Bar. Foo -bar. Bla. Bar bla. +bar. Bla bar bla. Foo bla bla nada bar.
      • Side by side (also available via sdiff)
        /tmp>diff -y foo bar Bla bla. Foo bar baz. | Bla ba +r. Foo bar baz. Nada nada nada. Nada? | Nada n +a-da nada. Nada? Foo foo foo! Bar. Foo Foo fo +o foo! Bar. Foo bar. Bla. Bar bla. | bar. B +la bar bla. Foo bla bla nada bar. Foo bl +a bla nada bar.
      • rcs
        /tmp>diff -n foo bar d1 2 a2 2 Bla bar. Foo bar baz. Nada na-da nada. Nada? d4 1 a4 1 bar. Bla bar bla.

      TortoiseSVN comes with a diff and merge tool called TortoiseMerge that can show changes side by side, highlighting not only changed lines, but also changes within the lines.


      Side note:

      sub comparator { my $str1 = shift @_; #... my $RE = qq|</span> |; traverse_sequences( \@from, \@to, { # ... } ); return ($original, $revised); } #END SUB comparator

      Proper indenting would make the "#END SUB comparator" redundant:

      sub comparator { my $str1 = shift @_; #... my $RE = qq|</span> |; traverse_sequences( \@from, \@to, { # ... } ); return ($original, $revised); }

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Regarding the proper indenting making my comment "redundant":

        I use indenting as well, but I tend to have subroutines that extend well beyond one screen's worth of code. I like having that note at the bottom just to help guide me in locating my position within the file as I'm scanning. I've developed a habit for doing it this way, and it's not going to change!

        Remember, TMTOWTDI. This is my way.

        Blessings,

        ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11127694]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2024-04-19 01:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found