Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

cmp two HTML fragments

by GrandFather (Saint)
on Feb 09, 2008 at 20:54 UTC ( [id://667203]=CUFP: print w/replies, xml ) Need Help??

I had a need to compare two fragments of HTML to see if they were equivalent.

This snippet builds two HTML::TreeBuilder representations of the fragments, then recursively compares the contents of the fragments.

To use the snippet call cmpHtml passing the two fragments as strings:

print cmpHtml( '<p><font foo="bar" bar="1">bar 1</font></p>', '<p><font bar="2" foo="bar">bar 1</font></p>' );

or if you already have two HTML::Elements that you want to compare you can:

print cmpHtmlElt ($elt1, $elt2);
sub cmpHtml { my ($html1, $html2) = @_; my $root1 = HTML::TreeBuilder->new; my $root2 = HTML::TreeBuilder->new; $root1->parse_content ($html1); $root1->elementify (); $root2->parse_content ($html2); $root2->elementify (); return cmpHtmlElt ($root1, $root2); } sub cmpHtmlElt { my ($elt1, $elt2) = @_; my $cmp = defined $elt1 cmp defined $elt2; return $cmp if $cmp; return 0 unless defined $elt1; $cmp = ref $elt1 cmp ref $elt2; return $cmp if $cmp; return $elt1 cmp $elt2 unless ref $elt1; $cmp = $elt1->tag () cmp $elt2->tag (); return $cmp if $cmp; my %attribs1 = $elt1->all_attr (); my %attribs2 = $elt2->all_attr (); $cmp = keys %attribs1 <=> keys %attribs2; return $cmp if $cmp; for my $key (keys %attribs1) { return 1 unless exists $attribs2{$key}; next if $key =~ /^_/; $cmp = $attribs1{$key} cmp $attribs2{$key}; return $cmp if $cmp; } my @children1 = $elt1->content_list (); my @children2 = $elt2->content_list (); $cmp = @children1 <=> @children2; return $cmp if $cmp; for my $index (0 .. $#children1) { $cmp = cmpHtmlElt ($children1[$index], $children2[$index]); return $cmp if $cmp; } }

Replies are listed 'Best First'.
Re: cmp two HTML fragments
by lodin (Hermit) on Feb 10, 2008 at 14:51 UTC

    Nice. Have you considered turning this into a module?

    Another way to do this is to use HTML::PrettyPrinter or somesuch and do a string-wise comparision. That way it's easier to find how the code differes (using string diff tools) if needed, but it's probably a lot slower.

    There's an (inherited) bug in your code. It leaks memory. You need to free the circular references in the tree by using the delete method:

    sub cmpHtml { ... my $cmp = cmpHtmlElt ($root1, $root2); $_->delete for $root1, $root2; return $cmp; }

    As a parenthesis I'd like to share this little trick:

    $cmp = EXPR; return $cmp if $cmp;
    which you use make plenty use of can be replaced with
    { return EXPR || next }
    (assuming scalar context) though that may be a bit too obfuscated to use in public code. :-)

    lodin

      as it happens the code shown was pretty transient anyway. For the module test suite that I wrote the code for, I replaced it with:

      my $root1 = HTML::TreeBuilder->new (); my $root2 = HTML::TreeBuilder->new (); $root1->parse_content ($rendered)->elementify () ->delete_ignorable_whitespace (); $root2->parse_content ($expected)->elementify () ->delete_ignorable_whitespace (); is ($root1->as_HTML (undef, ' ', {}), $root2->as_HTML (undef, ' ', {}), $testName);

      in any case so that I'd get better diagnostics (I see the two HTML fragments when the test fails). However, with a little tweaking to give a traceback the original code would be even better in the test context because it would highlight the difference by reducing the clutter. That version might almost be worth generating a module for.


      Perl is environmentally friendly - it saves trees
Re: cmp two HTML fragments
by planetscape (Chancellor) on Mar 22, 2008 at 21:09 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://667203]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2024-04-25 08:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found