sourcecode
OeufMayo
<code>#!/usr/bin/perl -w
use strict;
use HTML::Parser ();
# See PYX format description
# http://www.xml.com/pub/a/2000/03/15/feature/index.html
my $parser = HTML::Parser->new(
xml_mode => 1,
unbroken_text => 1,
ignore_elements => ['style', 'script'], # CDATA isn't supported
start_h => [
sub {
my ($tag, $attr) = @_;
print "($tag\n";
print "A$_\n-$attr->{$_}\n" foreach keys %{$attr};
}, "tagname, attr"],
end_h => [
sub {
print ")" . shift() . "\n";
}, "tagname"],
text_h => [
sub {
my $text = shift;
$text =~ s/^\s*|\s*$//g;
print "-$text\n"
}, "dtext"],
);
die "usage: $0 file1.html > file1.pyx\n" unless @ARGV;
foreach (@ARGV){
$parser->parse_file($_);
$parser->eof();
}
</code>
<p>Pyxie is an alternative way of representing XML datas. These
datas are represented in a really simple way, one information
per line.<br />
The nice thing about PYX is the ease of parsing the informations
you get, on the other hand, there are a lot of features found
in the XML format that can't be representated by PYX (CDATA,
entities,...)
</p>
<p>Now, I know the module [cpan://XML::PYX] exists, and it
even comes with a script called pyxhtml, which does pretty
much what this code does.<br />But XML::PYX <i>per se</i>
isn't really flexible if you want a finer control over what's
being kept or not in the HTML file.</p>
<p>Hopefully, this code can be easily customized to suit your
needs, provided you know how to use HTML::Parser (which is
really fun to use, especially the v.3).</p>
<p>And the really cool thing is that your HTML doesn't have
to be a valid XML file! (I wouldn't try to feed it Word 2000
pseudo-HTML though...)</p>
<p>[http://www.xml.com/pub/a/2000/03/15/feature/index.html|More infos on PYX]</p>
HTML Utility
Briac Pilpré - briac@cpan.org <!-- Hubris! he he :) -->