Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Dynamically cleaning up HTML fragments

by wfsp (Abbot)
on Sep 24, 2010 at 08:12 UTC ( [id://861746]=note: print w/replies, xml ) Need Help??


in reply to Dynamically cleaning up HTML fragments

I highly recommend having a look at Dave Raggett's HTML Tidy. I've found it to be a very nifty bit of kit for these types of jobs.

Careful tweaking of the config would, I beleive, achieve many of the tasks you are looking at.

  • Comment on Re: Dynamically cleaning up HTML fragments

Replies are listed 'Best First'.
Re^2: Dynamically cleaning up HTML fragments
by SilasTheMonk (Chaplain) on Sep 24, 2010 at 11:41 UTC
    Actually HTML::Tidy seems to have a bit of bad history at Debian. My original post that it is not in Debian is wrong, but its definitely in an odd state. I am investigating.
      Ubuntu 8.04, perl 5.10.1

      HTML::Tidy has been released three times this year (the last on 17 September) so some of the criticisms may have been addressed.

      It requires tidyp (version 1.04 recently released) which is a fork of tidy.

      I was able to install tidyp in the usual way and H::T installed without fuss using cpanp.

      #! /usr/bin/perl use strict; use warnings; use HTML::Tidy; my $tidy = HTML::Tidy->new( { output_xhtml => 1, tidy_mark => 0, markup => 1, q{show-body-only} => 1, } ); printf qq{tidyp: %s\n}, $tidy->tidyp_version; printf qq{libtidyp: %s\n}, $tidy->libtidyp_version; printf qq{HTML::Tidy: %s\n}, $HTML::Tidy::VERSION; my $html = do {local $/;<DATA>}; $tidy->parse(q{test.html}, $html) or die q{parse failed}; for my $message ($tidy->messages){ print $message->as_string, qq{\n}; } my $xhtml = $tidy->clean($html); print $xhtml; __DATA__ <div> <p>tidy</p> <img src="pic.jpg"> </div>
      tidyp: 1.04 libtidyp: 1.04 HTML::Tidy: 1.54 test.html (1:1) Warning: missing <!DOCTYPE> declaration test.html (1:1) Warning: inserting implicit <body> test.html (1:1) Warning: inserting missing 'title' element test.html (3:3) Warning: <img> lacks "alt" attribute <div> <p>tidy</p> <img src="pic.jpg" /></div>
      See the tidy quick reference for all the configuration options.
        Thanks. It actually installs fine on Debian using the packaging system. And I was able to use and configure it. The issues are:
        1. The version in Debian is old.
        2. An update does not appear to be happening I think due to the fork of tidy. It makes it very messy and until someone really screams it won't happen. I am in the relevant group and I won't volunteer.
        3. I could not configure it to change "<span>blah</span>" to "blah". Saying that tidy is not intended to do that is reasonable, but I want it to do that. Javascript rich text editors generate stuff that one does not necessarily want or need.
Re^2: Dynamically cleaning up HTML fragments
by petdance (Parson) on Sep 26, 2010 at 04:26 UTC
    tidyp is a fork of Dave's tidy, because the people who maintain tidy do not do releases. Without releases, it is all but impossible to build HTML::Tidy atop of it.

    xoxo,
    Andy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://861746]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-24 21:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found