Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

web_diff.pl

by ciderpunx (Vicar)
on Jun 29, 2006 at 16:22 UTC ( [id://558367]=CUFP: print w/replies, xml ) Need Help??

updated - fixed missing HTML::Strip object
updated - now does a tr on whitespace - thx shmem
Its a simple script to diff the text on web pages.

Someone used the phrase somewhere on the perlmonks site which made me think it'd be a handy thing to have - thanks to them.
#!/usr/bin/perl -w use strict; use LWP::Simple; use Text::Diff; use HTML::Strip; require 5.008_000; my $STORE="/home/charlie/diffs"; my $hs = HTML::Strip->new(); die ("Usage: $0 <URL_TO_DIFF>") unless ($#ARGV==0); my $url=$ARGV[0]; # 'nice' URL my $n_url=$url; $n_url=~s/^http:\/\///; $n_url=~s/\//_/g; my $store_as = (-e "$STORE/$n_url" ) ? "$STORE/$n_url.new" : "$STORE/$n_url"; if (is_success(getstore($url,$store_as))) { unless ($store_as eq "$STORE/$url") { + open (IN, $store_as); my @from=<IN>; close IN; open (IN,"$STORE/$n_url"); my @to=<IN>; close IN; my $from = $hs->parse(join ' ', @from); $from=~tr/[ \t]/ /s; my $to = $hs->parse(join ' ',@to); $to=~tr/[ \t]/ /s; my $diff = diff \$from, \$to; print $diff; rename $store_as, "$STORE/$n_url"; } } else { warn "Storing $store_as failed. Life sucks." } __END__ =head1 NAME web_diff.pl =head2 VERSION 0.1 =head1 SYNOPSIS diff text from a page retrieved off interweb and page stored locally =head1 DESCRIPTION Retrieve and store a page locally If we have a previously stored local copy, Compare retrieved and local page If they are not identical Strip html from them Print a diff =head2 OPTIONS =over =item C<URL TO DIFF> This isn't sanitized in properly, this code is not for use by people you don't trust implicitly :-) =back =head1 REQUIREMENTS =over =item Perl >= 5.8.0 (not tested on earlier versions) =item HTML::Strip =item Text::Diff =item LWP::Simple =back =head1 COPYRIGHT AND LICENCE Copyright (C)2006 Charlie Harvey This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. Also available on line: http://www.gnu.org/copyleft/gpl.html =head1 SEE ALSO =cut

Replies are listed 'Best First'.
Re: web_diff.pl
by shmem (Chancellor) on Jun 29, 2006 at 23:04 UTC
    The term "web diff" appeared here.

    You don't seem to handle white-space and line-breaks, so you could get false positives - differing files whilst they don't differ viewing them; and as HTML goes, they are identical ;-)

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Hey shmem thx for the original idea!

      I actually posted the wrong/first version of the code originally (doh!) - sorry about that (updated now). The idea was to parse out the HTML and diff only the text in the pages and I'd forgotten to strip the HTML.

      Stripping the whitespace seems like a good idea anyhow, so I shall add that in a second. I'm not so sure about linebreaks, but it'd be trivial to add if you found it useful. Cheers, Charlie
        Linebreaks and whitespace are no structural elements in HTML and thus cannot be used to divide text into reasonably small yet big enough chunks to get a meaningful diff from two versions of a document.

        Hence the idea to use punctuation as the structural element inherent to the text itself, to break if up into units that can be compared.

        My approach would seem to s/[\s\n]+/ /gs and s/([\.,:;\!\?])\s/$1\n/gs and diff the resulting lines.

        cheers,
        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: web_diff.pl
by Ieronim (Friar) on Jun 30, 2006 at 08:19 UTC
    Some time ago I was searching CPAN for something like this, and found a module called HTML::Diff. It does a slightly different work than your script does - it analyses HTML tags on the page too. I think it will be useful for you to study it :)

    HTH :)

Re: web_diff.pl
by derby (Abbot) on Jun 30, 2006 at 12:35 UTC

    Missing $hs creation?

    $ perl -c web_diff.pl Global symbol "$hs" requires explicit package name at web_diff.pl line + 28 Global symbol "$hs" requires explicit package name at web_diff.pl line + 30 web_diff.pl had compilation errors.

    -derby

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://558367]
Approved by Hue-Bond
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 17:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found