Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?


by OeufMayo (Curate)
on Aug 31, 2001 at 03:13 UTC ( #109244=sourcecode: print w/replies, xml ) Need Help??
Category: HTML Utility
Author/Contact Info Briac Pilpré -

Pyxie is an alternative way of representing XML datas. These datas are represented in a really simple way, one information per line.
The nice thing about PYX is the ease of parsing the informations you get, on the other hand, there are a lot of features found in the XML format that can't be representated by PYX (CDATA, entities,...)

Now, I know the module XML::PYX exists, and it even comes with a script called pyxhtml, which does pretty much what this code does.
But XML::PYX per se isn't really flexible if you want a finer control over what's being kept or not in the HTML file.

Hopefully, this code can be easily customized to suit your needs, provided you know how to use HTML::Parser (which is really fun to use, especially the v.3).

And the really cool thing is that your HTML doesn't have to be a valid XML file! (I wouldn't try to feed it Word 2000 pseudo-HTML though...)

More infos on PYX

#!/usr/bin/perl -w
use strict;
use HTML::Parser ();

# See PYX format description

my $parser = HTML::Parser->new(
        xml_mode        => 1,
        unbroken_text   => 1,
        ignore_elements => ['style', 'script'], # CDATA isn't supporte
        start_h => [
                sub {
                        my ($tag, $attr) = @_;
                        print "($tag\n";
                        print "A$_\n-$attr->{$_}\n" foreach keys %{$at
                }, "tagname, attr"],
        end_h   => [
                sub {
                        print ")" . shift() . "\n";
                }, "tagname"],
        text_h  => [
                sub {
                        my $text = shift;
                        $text =~ s/^\s*|\s*$//g;
                        print "-$text\n"
                }, "dtext"],

die "usage: $0 file1.html > file1.pyx\n" unless @ARGV;

foreach (@ARGV){
Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://109244]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2020-09-20 18:10 GMT
Find Nodes?
    Voting Booth?
    If at first I donít succeed, I Ö

    Results (122 votes). Check out past polls.