web_diff.pl

updated - fixed missing HTML::Strip object
updated - now does a tr on whitespace - thx shmem
Its a simple script to diff the text on web pages.

Someone used the phrase somewhere on the perlmonks site which made me think it'd be a handy thing to have - thanks to them.

#!/usr/bin/perl -w
use strict;
use LWP::Simple;
use Text::Diff;
use HTML::Strip;
require 5.008_000;

my $STORE="/home/charlie/diffs";
my $hs = HTML::Strip->new();

die ("Usage: $0 <URL_TO_DIFF>") unless ($#ARGV==0);

my $url=$ARGV[0];

# 'nice' URL
my $n_url=$url;
$n_url=~s/^http:\/\///;
$n_url=~s/\//_/g;

my $store_as = (-e "$STORE/$n_url" )
        ? "$STORE/$n_url.new"
        : "$STORE/$n_url";

if (is_success(getstore($url,$store_as)))  {
         unless ($store_as eq "$STORE/$url") {                        
+                                            
                open (IN, $store_as); my @from=<IN>; close IN;
                open (IN,"$STORE/$n_url"); my @to=<IN>; close IN;
                my $from = $hs->parse(join ' ', @from);
                $from=~tr/[ \t]/ /s;
                my $to = $hs->parse(join ' ',@to);
                $to=~tr/[ \t]/ /s;
                my $diff = diff \$from, \$to;
                print $diff;
                rename $store_as, "$STORE/$n_url";
        }
}
else {  warn "Storing $store_as failed. Life sucks."  }

__END__

=head1 NAME

web_diff.pl  

=head2 VERSION

0.1

=head1 SYNOPSIS

diff text from a page retrieved off interweb and page stored locally

=head1 DESCRIPTION

Retrieve and store a page locally
If we have a previously stored local copy,
        Compare retrieved and local page
        If they are not identical
                Strip html from them
                Print a diff

=head2 OPTIONS

=over

=item  C<URL TO DIFF>

This isn't sanitized in properly, this code is not for use by people
you don't trust implicitly :-)

=back

=head1 REQUIREMENTS

=over

=item Perl >= 5.8.0 (not tested on earlier versions)

=item HTML::Strip

=item Text::Diff

=item LWP::Simple

=back

=head1 COPYRIGHT AND LICENCE

               Copyright (C)2006  Charlie Harvey

 This program is free software; you can redistribute it and/or
 modify it under the terms of the GNU General Public License
 as published by the Free Software Foundation; either version
 2 of the License, or (at your option) any later version.

 This program is distributed in the hope that it will be
 useful, but WITHOUT ANY WARRANTY; without even the implied
 warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
 PURPOSE.  See the GNU General Public License for more
 details.

 You should have received a copy of the GNU General Public
 License along with this program; if not, write to the Free
 Software Foundation, Inc., 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.
 Also available on line: http://www.gnu.org/copyleft/gpl.html

=head1 SEE ALSO

=cut
[download]

--
charlieharvey.org.uk

Comment on web_diff.pl Download Code

Replies are listed 'Best First'.
Re: web_diff.pl by shmem (Chancellor) on Jun 29, 2006 at 23:04 UTC
The term "web diff" appeared here. You don't seem to handle white-space and line-breaks, so you could get false positives - differing files whilst they don't differ viewing them; and as HTML goes, they are identical ;-) --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^2: web_diff.pl by ciderpunx (Vicar) on Jun 30, 2006 at 11:43 UTC
Hey shmem thx for the original idea! I actually posted the wrong/first version of the code originally (doh!) - sorry about that (updated now). The idea was to parse out the HTML and diff only the text in the pages and I'd forgotten to strip the HTML. Stripping the whitespace seems like a good idea anyhow, so I shall add that in a second. I'm not so sure about linebreaks, but it'd be trivial to add if you found it useful. Cheers, Charlie -- charlieharvey.org.uk	[reply]
Re^3: web_diff.pl by shmem (Chancellor) on Jun 30, 2006 at 12:15 UTC
Linebreaks and whitespace are no structural elements in HTML and thus cannot be used to divide text into reasonably small yet big enough chunks to get a meaningful diff from two versions of a document. Hence the idea to use punctuation as the structural element inherent to the text itself, to break if up into units that can be compared. My approach would seem to `s/[\s\n]+/ /gs` and `s/([\.,:;\!\?])\s/$1\n/gs` and diff the resulting lines. cheers, --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re: web_diff.pl by Ieronim (Friar) on Jun 30, 2006 at 08:19 UTC
Some time ago I was searching CPAN for something like this, and found a module called HTML::Diff. It does a slightly different work than your script does - it analyses HTML tags on the page too. I think it will be useful for you to study it :) HTH :)	[reply]
Re^2: web_diff.pl by ciderpunx (Vicar) on Jun 30, 2006 at 11:49 UTC
Thx leronim - I shall have a look at that. -- charlieharvey.org.uk	[reply]
Re: web_diff.pl by derby (Abbot) on Jun 30, 2006 at 12:35 UTC
Missing $hs creation? `$ perl -c web_diff.pl Global symbol "$hs" requires explicit package name at web_diff.pl line + 28 Global symbol "$hs" requires explicit package name at web_diff.pl line + 30 web_diff.pl had compilation errors.` [download] -derby	[reply] [d/l]
Re^2: web_diff.pl by ciderpunx (Vicar) on Jun 30, 2006 at 12:54 UTC
oops - thx for spotting that - updated now -- charlieharvey.org.uk	[reply]


Perl Monk, Perl Meditation
	PerlMonks