Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Where did HTML::Sanitizer go?

by wazoox (Prior)
on Nov 16, 2009 at 18:32 UTC ( [id://807531]=perlquestion: print w/replies, xml ) Need Help??

wazoox has asked for the wisdom of the Perl Monks concerning the following question:

I've just read this Coding Horror post, which pointed to this nice post about How not to parse HTML with regexes.

So it talks about that wonderful useful CPAN module, HTML::Sanitizer. Unfortunately, it's nowhere to be found nowadays, except in the CPAN archive, that's too bad. Do you know why it's hidden, or what to use instead that would be just as easy and nice?

Replies are listed 'Best First'.
Re: Where did HTML::Sanitizer go?
by Old_Gray_Bear (Bishop) on Nov 16, 2009 at 19:02 UTC
    Take a look at HTML::Scrubber. It appears to be a replacement/enhanced tool. (Written by the Monastery's own PodMaster.)

    Update: It looks like HTML::Sanitize fell off of CPAN at least a year back. I have found 'module not found' errors for HTML::Sanitize in the Perl Tester's Reports as far back as Perl 5.8. Also, there are several sites that purport to have HTML-Sanitize-0.04 for down-load, but when you follow the trail, they all lead back to CPAN, sigh.

    ----
    I Go Back to Sleep, Now.

    OGB

      That's great, it has all the features plus the extra bonus to be, hum... available :)
Re: Where did HTML::Sanitizer go?
by redgreen (Priest) on Nov 17, 2009 at 01:53 UTC

    You also might want to consider http://tidy.sourceforge.net/ for cleaning up your HTML.

    While not a perl solution, it does bring some sanity to HTML tag soup.

      Well it's not the same usage. I use HTML Tidy all the time for static HTML code, but HTML::Sanitizer looked like a really great solution to "purify" input.
      Another option for tidying the HTML before sanitising it is XML::LibXML. It has a parse_html method that gracefully copes with mismatched tag nesting, broken quoting and other common offences. You can then use the toStringHTML method to produce nice clean HTML.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://807531]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2024-04-24 00:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found