Category: | HTML Utility |
Author/Contact Info | Briac Pilpré - briac@cpan.org |
Description: | Pyxie is an alternative way of representing XML datas. These
datas are represented in a really simple way, one information
per line. Now, I know the module XML::PYX exists, and it
even comes with a script called pyxhtml, which does pretty
much what this code does. Hopefully, this code can be easily customized to suit your needs, provided you know how to use HTML::Parser (which is really fun to use, especially the v.3). And the really cool thing is that your HTML doesn't have to be a valid XML file! (I wouldn't try to feed it Word 2000 pseudo-HTML though...) |
#!/usr/bin/perl -w use strict; use HTML::Parser (); # See PYX format description # http://www.xml.com/pub/a/2000/03/15/feature/index.html my $parser = HTML::Parser->new( xml_mode => 1, unbroken_text => 1, ignore_elements => ['style', 'script'], # CDATA isn't supporte +d start_h => [ sub { my ($tag, $attr) = @_; print "($tag\n"; print "A$_\n-$attr->{$_}\n" foreach keys %{$at +tr}; }, "tagname, attr"], end_h => [ sub { print ")" . shift() . "\n"; }, "tagname"], text_h => [ sub { my $text = shift; $text =~ s/^\s*|\s*$//g; print "-$text\n" }, "dtext"], ); die "usage: $0 file1.html > file1.pyx\n" unless @ARGV; foreach (@ARGV){ $parser->parse_file($_); $parser->eof(); } |
|
---|