Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^2: web_diff.pl

by ciderpunx (Vicar)
on Jun 30, 2006 at 11:43 UTC ( [id://558555]=note: print w/replies, xml ) Need Help??


in reply to Re: web_diff.pl
in thread web_diff.pl

Hey shmem thx for the original idea!

I actually posted the wrong/first version of the code originally (doh!) - sorry about that (updated now). The idea was to parse out the HTML and diff only the text in the pages and I'd forgotten to strip the HTML.

Stripping the whitespace seems like a good idea anyhow, so I shall add that in a second. I'm not so sure about linebreaks, but it'd be trivial to add if you found it useful. Cheers, Charlie

Replies are listed 'Best First'.
Re^3: web_diff.pl
by shmem (Chancellor) on Jun 30, 2006 at 12:15 UTC
    Linebreaks and whitespace are no structural elements in HTML and thus cannot be used to divide text into reasonably small yet big enough chunks to get a meaningful diff from two versions of a document.

    Hence the idea to use punctuation as the structural element inherent to the text itself, to break if up into units that can be compared.

    My approach would seem to s/[\s\n]+/ /gs and s/([\.,:;\!\?])\s/$1\n/gs and diff the resulting lines.

    cheers,
    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://558555]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2024-04-24 04:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found