Honestly, as I read the title of your node,
HTML tidy sprang immediately to my mind, as it even has command line switches used to specifically clean up Office HTML. On that website, there is also code on how to call HTML tidy from Perl, including some proposed error checking which seems mostly geared for Unix. On the second thought, it is not really clear why they use the code they use, so I'll post it here, together with my replacement :
## This is what I think is needed beforehand :
open( TIDY, "html-tidy $commandline|") or die "Couldn't spawn html-tid
+y : $!\n";
my @output;
@output = <TIDY>;
## Here begins their code :
if (close(TIDY) == 0) {
my $exitcode = $? >> 8;
if ($exitcode == 1) {
printf STDERR "tidy issued warning messages\n";
} elsif ($exitcode == 2) {
printf STDERR "tidy issued error messages\n";
} else {
die "tidy exited with code: $exitcode\n";
}
} else {
printf STDERR "tidy detected no errors\n";
}
I think this could simply be done with the following code,
but I haven't checked all possible outcomes...
my @output = qx(html-tidy $commandline);
my $exitcode = $? >> 8;
if ($exitcode == 1) {
printf STDERR "tidy issued warning messages\n";
} elsif ($exitcode == 2) {
printf STDERR "tidy issued error messages\n";
} else {
die "tidy exited with code: $exitcode\n";
}
Wrapping it up, unless you tell us a really convincing reason why html-tidy is not possible (and with not possible I also mean putting html-tidy into a Perl script, writing it out to /tmp, starting it there and afterwards deleting the file again), I'll stick with this solution :-)
perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The
$d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider
($c = $d->accept())->get_request(); $c->send_response( new #in the
HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.