Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

html2pyx

by OeufMayo (Curate)
on Aug 31, 2001 at 03:13 UTC ( [id://109244]=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utility
Author/Contact Info Briac Pilpré - briac@cpan.org
Description:

Pyxie is an alternative way of representing XML datas. These datas are represented in a really simple way, one information per line.
The nice thing about PYX is the ease of parsing the informations you get, on the other hand, there are a lot of features found in the XML format that can't be representated by PYX (CDATA, entities,...)

Now, I know the module XML::PYX exists, and it even comes with a script called pyxhtml, which does pretty much what this code does.
But XML::PYX per se isn't really flexible if you want a finer control over what's being kept or not in the HTML file.

Hopefully, this code can be easily customized to suit your needs, provided you know how to use HTML::Parser (which is really fun to use, especially the v.3).

And the really cool thing is that your HTML doesn't have to be a valid XML file! (I wouldn't try to feed it Word 2000 pseudo-HTML though...)

More infos on PYX

#!/usr/bin/perl -w
use strict;
use HTML::Parser ();

# See PYX format description
# http://www.xml.com/pub/a/2000/03/15/feature/index.html

my $parser = HTML::Parser->new(
        xml_mode        => 1,
        unbroken_text   => 1,
        ignore_elements => ['style', 'script'], # CDATA isn't supporte
+d
        start_h => [
                sub {
                        my ($tag, $attr) = @_;
                        print "($tag\n";
                        print "A$_\n-$attr->{$_}\n" foreach keys %{$at
+tr};
                }, "tagname, attr"],
        end_h   => [
                sub {
                        print ")" . shift() . "\n";
                }, "tagname"],
        text_h  => [
                sub {
                        my $text = shift;
                        $text =~ s/^\s*|\s*$//g;
                        print "-$text\n"
                }, "dtext"],
);

die "usage: $0 file1.html > file1.pyx\n" unless @ARGV;

foreach (@ARGV){
        $parser->parse_file($_);
        $parser->eof();
}

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://109244]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2024-03-28 11:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found