Syntactic Confectionery Delight | |
PerlMonks |
We all agree that Perl does a really good job when
it comes to text extraction, particulary with regular
expressions.
The XML is based on text, so one might think that it
would be dead easy to take any XML input and have it
converted in the way one wants.
Unfortunately, that is wrong. If you think you'll be
able to parse a XML file with your own homegrown parser
you did overnight, think again, and look at the XML specs
closely. It's as complex as the CGI specs, and you'll
never want to waste precious time trying to do something
that will surely end up wrong anyway. Most of the background
discussions on why you have to use CGI.pm instead of
your own CGI-parser apply here.
The aim of this tutorial is not to show you how XML
should be structured and why you shouldn't parse it by hand
but how to use the proper tool to do the right job.
I'll focus on the most basic XML module you can find,
XML::Parser. It's written by Larry Wall and Clark Cooper,
and I'm sure we can trust the former to make good
software (rn and patch are his most famous programs)
Okay, enough talk, let's jump into the module!
This tutorial will only show you the basics of XML
parsing, using the easiest (IMHO) methods. Please
refer to the perldoc XML::Parser for more
detailed info.
I'm aware that there are a lot of XML tools available,
but knowing how to use XML::Parser can surely help
you a lot when you don't have any other module
to work with, and it also helped me to understand how
other XML modules worked, since most of them are built on
top of XML::Parser.
The example I'll use for this tutorial is the Perlmonks
Chatterbox ticker that some of you may have already used.
It looks like this:
<CHATTER><INFO site="http://perlmonks.org" sitename="Perl Monks"> Rendered by the Chatterbox XML Ticker</INFO> <message author="OeufMayo" time="20010228112952"> test</message> <message author="deprecated" time="20010228113142"> pong</message> <message author="OeufMayo" time="20010228113153"> /me test again; :)</message> <message author="OeufMayo" time="20010228113255"> <a href="#">please note the use of HTML tags</a></message> </CHATTER>
Thanks to deprecated for his unaware intervention here
( The astute reader will notice that in the CB ticker, a 'user_id' has shown up recently. Since it wasn't there when I took my 'snapshot' of the CB, I'll ignore it, but don't worry the code below won't break at all, precisely because I used a proper parser to handle that for me! )
Let's assume we want to output this file in a readable way (though it'll still be barebone). It doesn't handles links and internal HTML entities. It only gets the CB ticker, parses it and prints it, you have to launch it again to follow the wise meditations and the brilliant rethoric of the other fine monks present at the moment.
1 #!/usr/bin/perl -w 2 use strict; 3 use XML::Parser; 4 use LWP::Simple; # used to fetch the chatterbox ticker 5 6 my $message; # Hashref containing infos on a message 7 8 my $cb_ticker = get("http://perlmonks.org/index.pl?node=chatterbox+ +xml+ticker"); 9 # we should really check if it succeeded or not 10 11 my $parser = new XML::Parser ( Handlers => { # Creates our parse +r object 12 Start => \&hdl_start, 13 End => \&hdl_end, 14 Char => \&hdl_char, 15 Default => \&hdl_def, 16 }); 17 $parser->parse($cb_ticker); 18 19 # The Handlers 20 sub hdl_start{ 21 my ($p, $elt, %atts) = @_; 22 return unless $elt eq 'message'; # We're only interrested in +what's said 23 $atts{'_str'} = ''; 24 $message = \%atts; 25 } 26 27 sub hdl_end{ 28 my ($p, $elt) = @_; 29 format_message($message) if $elt eq 'message' && $message && $ +message->{'_str'} =~ /\S/; 30 } 31 32 sub hdl_char { 33 my ($p, $str) = @_; 34 $message->{'_str'} .= $str; 35 } 36 37 sub hdl_def { } # We just throw everything else 38 39 sub format_message { # Helper sub to nicely format what we got fro +m the XML 40 my $atts = shift; 41 $atts->{'_str'} =~ s/\n//g; 42 43 my ($y,$m,$d,$h,$n,$s) = $atts->{'time'} =~ m/^(\d{4})(\d{2})( +\d{2})(\d{2})(\d{2})(\d{2})$/; 44 45 # Handles the /me 46 $atts->{'_str'} = $atts->{'_str'} =~ s/^\/me// ? 47 "$atts->{'author'} $atts->{'_str'}" : 48 "<$atts->{'author'}>: $atts->{'_str'}"; 49 $atts->{'_str'} = "$h:$n " . $atts->{'_str'}; 50 print "$atts->{'_str'}\n"; 51 undef $message; 52 }
Step-by-step code walkthrough:
The most interesting part, no doubt. We create here a new XML::Parser object. The Parser can come in different styles, but when you have to deal with simple data, like the CB ticker, the Handlers way is the easiest (see also the Subs style, as it is really close to this one).
For this object, we define four handlers subs, each representing a different state in the parsing process.
We only want to deal with the <message> elements (those containing what it is being said in the Chatterbox) so we'll happily skip every other element.
We got a hash with the attributes of the element, and we're going to use this hash to store the string that will contain the text to be displayed in the $atts{'_str'}
We now have a complete and simple parser, ready to analyse, extract, report everything inside the Chatterbox XML ticker!
That's all for now, here are some links you may find useful:
Thanks to mirod, arhuman and danger for the review!
|
---|
Replies are listed 'Best First'. | |
---|---|
Loading a Local File
by Sherlock (Deacon) on Apr 18, 2001 at 00:43 UTC | |
by Anonymous Monk on Jan 04, 2018 at 11:09 UTC | |
by Corion (Patriarch) on Jan 04, 2018 at 12:07 UTC | |
Re: XML::Parser Tutorial
by gildir (Pilgrim) on Mar 07, 2001 at 21:09 UTC | |
by mirod (Canon) on Mar 07, 2001 at 21:29 UTC | |
by merlyn (Sage) on Mar 07, 2001 at 21:11 UTC | |
by gildir (Pilgrim) on Mar 07, 2001 at 21:40 UTC | |
by Anonymous Monk on Mar 14, 2001 at 15:41 UTC | |
Re: XML::Parser Tutorial
by Jenda (Abbot) on Aug 21, 2008 at 19:14 UTC | |
by Mike Blume (Initiate) on Aug 22, 2008 at 18:49 UTC | |
by Jenda (Abbot) on Aug 22, 2008 at 19:58 UTC | |
by Mike Blume (Initiate) on Aug 23, 2008 at 20:23 UTC | |
by Jenda (Abbot) on Aug 24, 2008 at 12:41 UTC | |
Re: XML::Parser Tutorial
by Anonymous Monk on Sep 20, 2001 at 22:51 UTC | |
by ajt (Prior) on Sep 30, 2001 at 21:13 UTC | |
Re: XML::Parser Tutorial
by Mike Blume (Initiate) on Aug 21, 2008 at 18:10 UTC | |
by Anonymous Monk on Jun 01, 2012 at 16:27 UTC | |
Re: XML::Parser Tutorial
by Anonymous Monk on Apr 24, 2013 at 17:42 UTC | |
by runrig (Abbot) on Apr 24, 2013 at 18:30 UTC | |
by Anonymous Monk on Apr 24, 2013 at 20:01 UTC | |
That's not a simple parser
by wee (Scribe) on Jan 29, 2015 at 19:19 UTC | |
by toolic (Bishop) on Jan 29, 2015 at 19:33 UTC | |
by karlgoethebier (Abbot) on Feb 01, 2015 at 13:07 UTC | |
by Anonymous Monk on Feb 02, 2015 at 08:16 UTC | |
by karlgoethebier (Abbot) on Feb 02, 2015 at 08:39 UTC |