XML::Parser Tutorial

We all agree that Perl does a really good job when it comes to text extraction, particulary with regular expressions.
The XML is based on text, so one might think that it would be dead easy to take any XML input and have it converted in the way one wants.
Unfortunately, that is wrong. If you think you'll be able to parse a XML file with your own homegrown parser you did overnight, think again, and look at the XML specs closely. It's as complex as the CGI specs, and you'll never want to waste precious time trying to do something that will surely end up wrong anyway. Most of the background discussions on why you have to use CGI.pm instead of your own CGI-parser apply here.

The aim of this tutorial is not to show you how XML should be structured and why you shouldn't parse it by hand but how to use the proper tool to do the right job.
I'll focus on the most basic XML module you can find, XML::Parser. It's written by Larry Wall and Clark Cooper, and I'm sure we can trust the former to make good software (rn and patch are his most famous programs)
Okay, enough talk, let's jump into the module!

This tutorial will only show you the basics of XML parsing, using the easiest (IMHO) methods. Please refer to the perldoc XML::Parser for more detailed info.
I'm aware that there are a lot of XML tools available, but knowing how to use XML::Parser can surely help you a lot when you don't have any other module to work with, and it also helped me to understand how other XML modules worked, since most of them are built on top of XML::Parser.
The example I'll use for this tutorial is the Perlmonks Chatterbox ticker that some of you may have already used. It looks like this:

<CHATTER><INFO site="http://perlmonks.org" sitename="Perl Monks">
Rendered by the Chatterbox XML Ticker</INFO>
	<message author="OeufMayo" time="20010228112952">
test</message>
	<message author="deprecated" time="20010228113142">
pong</message>
	<message author="OeufMayo" time="20010228113153">
/me test again; :)</message>
	<message author="OeufMayo" time="20010228113255">
&lt;a href="#"&gt;please note the use of HTML 
tags&lt;/a&gt;</message>
</CHATTER>

Thanks to deprecated for his unaware intervention here

( The astute reader will notice that in the CB ticker, a 'user_id' has shown up recently. Since it wasn't there when I took my 'snapshot' of the CB, I'll ignore it, but don't worry the code below won't break at all, precisely because I used a proper parser to handle that for me! )

Let's assume we want to output this file in a readable way (though it'll still be barebone). It doesn't handles links and internal HTML entities. It only gets the CB ticker, parses it and prints it, you have to launch it again to follow the wise meditations and the brilliant rethoric of the other fine monks present at the moment.

1  #!/usr/bin/perl -w
2  use strict;
3  use XML::Parser;
4  use LWP::Simple;  # used to fetch the chatterbox ticker
5  
6  my $message;      # Hashref containing infos on a message
7  
8  my $cb_ticker = get("http://perlmonks.org/index.pl?node=chatterbox+
+xml+ticker"); 
9  # we should really check if it succeeded or not
10   
11  my $parser = new XML::Parser ( Handlers => {   # Creates our parse
+r object
12                              Start   => \&hdl_start,
13                              End     => \&hdl_end,
14                              Char    => \&hdl_char,
15                              Default => \&hdl_def,
16                            });
17  $parser->parse($cb_ticker);
18   
19  # The Handlers
20  sub hdl_start{
21      my ($p, $elt, %atts) = @_;
22      return unless $elt eq 'message';  # We're only interrested in 
+what's said
23      $atts{'_str'} = '';
24      $message = \%atts; 
25  }
26   
27  sub hdl_end{
28      my ($p, $elt) = @_;
29      format_message($message) if $elt eq 'message' && $message && $
+message->{'_str'} =~ /\S/;
30  }
31  
32  sub hdl_char {
33      my ($p, $str) = @_;
34      $message->{'_str'} .= $str;
35  }
36  
37  sub hdl_def { }  # We just throw everything else
38  
39  sub format_message { # Helper sub to nicely format what we got fro
+m the XML
40      my $atts = shift;
41      $atts->{'_str'} =~ s/\n//g;
42  
43      my ($y,$m,$d,$h,$n,$s) = $atts->{'time'} =~ m/^(\d{4})(\d{2})(
+\d{2})(\d{2})(\d{2})(\d{2})$/;
44  
45      # Handles the /me
46      $atts->{'_str'} = $atts->{'_str'} =~ s/^\/me// ?
47      "$atts->{'author'} $atts->{'_str'}"   :
48      "<$atts->{'author'}>: $atts->{'_str'}";
49      $atts->{'_str'} = "$h:$n " . $atts->{'_str'};
50      print "$atts->{'_str'}\n";
51      undef $message;
52  }
[download]

Step-by-step code walkthrough:

Lines 1 to 4

Initialisation of the basics needed for this snippet, XML::Parser, of course, and LWP::Simple to get the chatterbox ticker.

Line 8

LWP::Simple get the requested URL, and put the content of the page in the $cb_ticker scalar.

Lines 11 to 16

The most interesting part, no doubt. We create here a new XML::Parser object. The Parser can come in different styles, but when you have to deal with simple data, like the CB ticker, the Handlers way is the easiest (see also the Subs style, as it is really close to this one).

For this object, we define four handlers subs, each representing a different state in the parsing process.

The 'Start' handler is called whenever a new element (or tag, HTML-wise) is found. The sub given is called with the expat object, the name of the element, and a hash containing all the atrributes of this element.
The 'End' is called whenever an element is closed, and is called with the same parameters as the 'Start', minus the attributes.
The 'Char' handler is called when the parser finds something which is not mark-up (in our case, the text enclosed in the <message> tag).
Finally, the 'Default' handler is called, well, by default, when anything else matching the three other handlers is called.

Line 17

The line that does all the magic, parsing and calling all your subs for you at the right moment.

Lines 20-25: the Start handler

We only want to deal with the <message> elements (those containing what it is being said in the Chatterbox) so we'll happily skip every other element.

We got a hash with the attributes of the element, and we're going to use this hash to store the string that will contain the text to be displayed in the $atts{'_str'}

Lines 27-30: the End handler

Once we've reached the end of a message element, we format all the info we have gathered and prints them via the format_message sub.

Lines 32-35: the Char handler

This sub gets all the strings returned by the parser and appends it to the string to be finally displayed

Line 37: the Default handler

It does nothing, but it doesn't have to figure out what to do with this!

Lines 39-52

This subroutine mangles all the info we got from the XML file, with bad regexes and all, and prints the formatted text in a hopefully readable way. Please note that XML::Parser handled all of the decoding of the < and > entities that were included in the original XML file

We now have a complete and simple parser, ready to analyse, extract, report everything inside the Chatterbox XML ticker!

That's all for now, here are some links you may find useful:

Most of mirod's nodes (and especially his review of XML::Parser)
davorg's Data Munging with Perl

Thanks to mirod, arhuman and danger for the review!

Comment on XML::Parser Tutorial Select or Download Code

Replies are listed 'Best First'.

Loading a Local File
by Sherlock (Deacon) on Apr 18, 2001 at 00:43 UTC

use IO::File;
...
my $fileStream = new IO::File ("yourDocumentName.xml");
...
my $parser = new XML::Parser 
   ( Handlers => 
      {   # Creates our parser object
         Start   => \&hdl_start,
         End     => \&hdl_end,
         Char    => \&hdl_char,
         Default => \&hdl_def,
      }
   );
...
$parser->parse($fileStream);
[download]

A Quick revision:

$parser->parsefile($filename)
[download]

[reply]
[d/l]
[select]

Re: Loading a Local File

by Anonymous Monk on Jan 04, 2018 at 11:09 UTC

Anyone facing xml bomb issue with the below mentioned code which was discussed above?

my $parser = new XML::Parser 
   ( Handlers => 
      {   # Creates our parser object
         Start   => \&hdl_start,
         End     => \&hdl_end,
         Char    => \&hdl_char,
         Default => \&hdl_def,
      }
   );
...
$parser->parse($fileStream);
[download]

Is xml bomb issue applicable for this XML::Parser module? Can anyone shed some light on this?

-- Nagalakshmi

[reply]
[d/l]

Re^2: Loading a Local File

by Corion (Patriarch) on Jan 04, 2018 at 12:07 UTC

What you call "xml bomb" is most likely the XML Entity Expansion attack.

This is most easily prevented by not expanding entities, or not expanding them recursively.

To enable that, see the XML::Parser documentation, especially the NoExpand flag and the handlers for external and other entities.

In those, you get to decide whether to fetch them and whether to expand them. If an entity expands to more entities, consider whether to expand them or not.

[reply]
[d/l]

Re: XML::Parser Tutorial
by gildir (Pilgrim) on Mar 07, 2001 at 21:09 UTC

subclassing

Is there some clean way how to subclass XML::Parser?

[reply]

Re: Re: XML::Parser Tutorial

by mirod (Canon) on Mar 07, 2001 at 21:29 UTC

The problem is probably that XML::Parser is an object factory: it generates XML::Parser::Expat objects with each parse or parsefile call. The handlers then receive XML::Parser::Expat objects and not XML::Parser objects.

There is a way to store data in the XML::Parser object and to access it in the handlers though: use the 'Non-Expat-Options' argument when creating the XML::Parser:

#!/bin/perl -w
use strict;
use XML::Parser;

my $p= new XML::Parser(
        'Non-Expat-Options' => { my_option => "toto" },
        Handlers => { Start => \&start, }
                      );
$p->parse( '<a />');

sub start
  { my( $pe, $elt, %atts)= @_;
    print "my option: ", $pe->{'Non-Expat-Options'}->{my_option}, "\n"
+;
  }
[download]

This is certainly ugly but it works!

Update: note that the data is still stored in the XML::Parser object though, as shown by this code:

#!/bin/perl -w
use strict;
use XML::Parser;

my $p= new XML::Parser(
        'Non-Expat-Options' => { my_option => "1" },
        Handlers => { Start => \&start, }
                      );
$p->parse( '<a />');
$p->parse( '<b />');

sub start
  { my( $pe, $elt, %atts)= @_;
    print "element: $elt - my option: ", 
          $pe->{'Non-Expat-Options'}->{my_option}++, "\n";
    $p->parse( '<c />') 
       unless( $pe->{'Non-Expat-Options'}->{my_option} > 3);
  }
[download]

Which outputs:

element: a - my option: 1
element: c - my option: 2
element: c - my option: 3
element: b - my option: 4
[download]

[reply]
[d/l]
[select]

Re: Re: XML::Parser Tutorial

by merlyn (Sage) on Mar 07, 2001 at 21:11 UTC

very

Just delegate the methods that you want to provide in your interface, and handle the rest. Make a hash with one of the elements being your "inherited" parser. I believe it's called the "wrapper" pattern, but I don't name my patterns�I just use them!

-- Randal L. Schwartz, Perl hacker

[reply]

Re: Re: Re: XML::Parser Tutorial

by gildir (Pilgrim) on Mar 07, 2001 at 21:40 UTC

Suppose I do not subclass XML::Parser. But then, how do I pass parameters to XML::Parser handler methods and collect results of their run without using global variables of XML::Parser package? Only class that I get to handler methods is expat itself and there is no place for any aditional parameters/results of handler methods.

And if I subclass XML::Parser, only advantage that I gain is using my own package namespace for global variables instead of XML::Parser's namespace. This do not looks to me like a good example of object oriented programming style.

Possible silution is the one mirod suggested using Non-Expat-Options but it is just a little bit less ugly than these two.

There best solution will be forcing XML::Parser to use my custom subclass of XML::Parser::Expat instead of XML::Parser::Expat itself. Is there some way how to do that?

[reply]

Re: Re: Re: Re: XML::Parser Tutorial

by Anonymous Monk on Mar 14, 2001 at 15:41 UTC

Re: XML::Parser Tutorial
by Jenda (Abbot) on Aug 21, 2008 at 19:14 UTC

The first rule of XML::Parser's use: Don't. Or rather, don't use it directly. Unless you really must which is much less often than you might think.

#!/usr/bin/perl -w
use strict;
use XML::Rules;

use LWP::Simple;  # used to fetch the chatterbox ticker

my $cb_ticker = get("http://perlmonks.org/index.pl?node=chatterbox+xml
++ticker");

my $parser = XML::Rules->new(
    stripspaces => 7,
    rules => {
        message => sub {
            my ($tag, $atts) = @_;
            $atts->{'_content'} =~ s/\n//g;

            my ($y,$m,$d,$h,$n,$s) = $atts->{'time'} =~ m/^(\d{4})(\d{
+2})(\d{2})(\d{2})(\d{2})(\d{2})$/;

            # Handles the /me
            $atts->{'_content'} = $atts->{'_content'} =~ s/^\/me// ?
            "$atts->{'author'} $atts->{'_content'}"   :
            "<$atts->{'author'}>: $atts->{'_content'}";
            $atts->{'_content'} = "$h:$n " . $atts->{'_content'};
            print "$atts->{'_content'}\n";
            return;
        },
        'INFO,CHATTER' => 'pass',
    }
);

$parser->parse($cb_ticker);
[download]

Isn't this easier? Now imagine the <message> tag was not so simple, imagine it contained a structure of subtags and subsubtags. Your handlers would have to keep track of where in the structure the parser is and would have to build the datastructure containin that data so that finaly they can access it in the endtag handler if and only iff the tag is <message>. Not what I would call convenient.

With XML::Rules you'd just specify what tags do you want to include (and whether they are supposed to be repeated, contain text content etc. ... the rules may be infered from a DTD or example) and assign a handler specificaly to the <message> tag. And the handler will have access to the datastructure built from the subtags.

With XML::Twig you'll specify the twig_root (or something, I don't remember details) and again will assign a handler to the specific tag and receive all the data from the part of the XML enclosed in it.

And in neither case does the parser have to parse the whole file before your handlers are started and at no time is the whole parsed XML in the memory. (Well, if you use the modules correctly.)

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]

Re^2: XML::Parser Tutorial

by Mike Blume (Initiate) on Aug 22, 2008 at 18:49 UTC

[reply]

Re^3: XML::Parser Tutorial

by Jenda (Abbot) on Aug 22, 2008 at 19:58 UTC

No it was a response to the root node. For your problem ... show us your code. It's true that the Char handler will never be called, but both the Start and End handlers should. In either case you of course can use XML::Rules for that XML as well, specify a handler for the student tag and it will obtain all the attributes. Or, it you do not need to handle the individual <student> tags as you read them, specify student => 'as array', or possibly student => 'by name', in the rules. And handle the array of students in $attr->{student} or access the individual students as $attr->{$name} (depends on the rule you specify for <student>) in the handler for <class>.

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]

Re^4: XML::Parser Tutorial

by Mike Blume (Initiate) on Aug 23, 2008 at 20:23 UTC

Re^5: XML::Parser Tutorial

by Jenda (Abbot) on Aug 24, 2008 at 12:41 UTC

Re: XML::Parser Tutorial
by Anonymous Monk on Sep 20, 2001 at 22:51 UTC

When we use Xml::parser for parsing how to check the well formedness of the xml tag ,ie to check all the tags are closed and given tags are in pairs like that .......Is there is any method for that?Please let me know. Thanxz, Prabu

[reply]

Re: Re: XML::Parser Tutorial

by ajt (Prior) on Sep 30, 2001 at 21:13 UTC

So if you try and parse a file that isn't well-formed XML you will discover this very quickly as the parser will die. (This is incientally a quick way of figuring out if a file is XML.)

Davorg gives a good example here: Re: Is a file XML?. Basically you eval the parse call to trap the die, then do what you want afterwards.

[reply]
[d/l]
[select]

Re: XML::Parser Tutorial
by Mike Blume (Initiate) on Aug 21, 2008 at 18:10 UTC

   <class>
      <student name="Student1" race="Fast" age="Std" />
      <student name="Student2" race="Slow" age="New" />
      <student name="Student3" race="Okay" age="Old" />
   </class>
[download]

[reply]
[d/l]

Re^2: XML::Parser Tutorial

by Anonymous Monk on Jun 01, 2012 at 16:27 UTC

search for parsing xml attributes. I'm trying to figure out a similar parsing issues. It appears that these are xml elements with no values, only attributes? I haven't been able to find much on this.

[reply]

Re: XML::Parser Tutorial
by Anonymous Monk on Apr 24, 2013 at 17:42 UTC

How can I select data that's nested but have the same tag names in another section of the xml? With the xml formed like this, the parser will get all information tagged with "message", rather than getting just ones in "urgent" and/or "update". Should I just start another parse instance in the start handler?

<MAILBAG>
    <urgent>
        <message author="PatrickHenry" time="20010228212953">Now is th
+e time for all good men to come to the aid of their party.</message>
    </urgent>
    <update>
        <message author="OeufMayo" time="20010228112952">test</message
+>
        <message author="deprecated" time="20010228113142">pong</messa
+ge>
    </update>
</MAILBAG>
[download]

[reply]
[d/l]

Re^2: XML::Parser Tutorial

by runrig (Abbot) on Apr 24, 2013 at 18:30 UTC

not use XML::Parser directly

[reply]

Re^3: XML::Parser Tutorial

by Anonymous Monk on Apr 24, 2013 at 20:01 UTC

Yes perfect! I knew there had to be a better way. Thanks!

[reply]

That's not a simple parser
by wee (Scribe) on Jan 29, 2015 at 19:19 UTC

I know this is a very old post, but in case anyone arrives here via Google, I wanted to note that this tutorial is misguided. There's absolutely no reason whatsoever to use XML::Parser, and the code above is needlessly complex. There's also about three times as much code as you need for something this simple. So why not use XML::Simple? It's pretty straightforward:

use warnings;
use strict;

use XML::Simple;

# Just an example. You'd use LWP to get the actual CB.
my $xml = '<CHATTER><INFO site="http://perlmonks.org" sitename="Perl M
+onks">Rendered by the Chatterbox XML Ticker</INFO>
    <message author="OeufMayo" time="20010228112952">test</message>
    <message author="deprecated" time="20010228113142">pong</message>
    <message author="OeufMayo" time="20010228113153">/me test again; :
+)</message>
    <message author="OeufMayo" time="20010228113255">&lt;a href="#"&gt
+;please note the use of HTML tags&lt;/a&gt;</message></CHATTER>';

my $ref = XMLin($xml);

foreach my $msg (@{$ref->{'message'}}) {
  my $h = substr($msg->{'time'}, 8, 2);
  my $n = substr($msg->{'time'}, 10, 2);

  my $author = $msg->{'content'} =~ s/^\/me// ? $msg->{'author'} : "<$
+msg->{'author'}>";
  print "$h:$n $author: $msg->{'content'}\n";
}
[download]

[reply]
[d/l]

Re: That's not a simple parser (nor is XML:Simple)

by toolic (Bishop) on Jan 29, 2015 at 19:33 UTC

So why not use XML::Simple?

XML::Simple

The use of this module in new code is discouraged. Other modules are available which provide more straightforward and consistent interfaces. In particular, XML::LibXML is highly recommended.

The major problems with this module are the large number of options and the arbitrary ways in which these options interact - often with unexpected results.

Patches with bug fixes and documentation fixes are welcome, but new features are unlikely to be added.

[reply]

Re: That's not a simple parser

by karlgoethebier (Abbot) on Feb 01, 2015 at 13:07 UTC

Please see also XML::LibXML::Simple for a handy replacement for XML::Simple: