Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

XML::Twig parsing poorly structured content

by slugger415 (Monk)
on Jan 24, 2017 at 15:46 UTC ( [id://1180218]=perlquestion: print w/replies, xml ) Need Help??

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, I'm using XML::Twig to parse an XML document, with dates and events, that is not as nicely structured as I'd like. Here's a pseudo-code example:

<div id="calendar"> <h3 class="current-day">Wednesday, February 1</h3> <div class="event">Event 1</div> <div class="event">Event 2</div> <div class="event">Event 3</div> <h3 class="current-day">Thursday, February 2</h3> <div class="event">Event 1</div> <div class="event">Event 2</div> <h3 class="current-day">Friday, February 3</h3> <div class="event">Event 1</div> <div class="event">Event 2</div> </div>

The problem is the div-events are not contained within the h3 elements, so I can't figure out how to associate the events with each date. I can get all the h3 children and all the div-event children with a div event handler at the top level:

my($twig, $div) = @_; if($div->att('id') eq 'calendar'){ my(@dates) = $div->children('h3'); my(@events) = $div->children('div'); }

But obviously that just gives me two unconnected lists. Is there some clever way I can associate these elements, perhaps in the order they appear? Doesn't seem to be a "next_child" function in XML::Twig.

Thanks for any advice.

Replies are listed 'Best First'.
Re: XML::Twig parsing poorly structured content
by choroba (Cardinal) on Jan 24, 2017 at 16:38 UTC
    In the h3 handler, set a global header, and use it in the div handler.
    #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::Twig; my $header; my $twig = 'XML::Twig'->new( twig_handlers => { h3 => sub { $header = $_->text; $_->purge; }, 'div[@class="event"]' => sub { say $header, "\t", $_->text; $_->purge; }, }, ); $twig->parsefile('file.xml');

    In bigger projects, you don't want to have a global header. Instead, you can create a new class that has two attributes, header and twig, which delegates all the XML related work to the latter and stores the headers in the former.

    #!/usr/bin/perl { package XML::Twig::WithHeader; use feature qw{ say }; use Moo; use XML::Twig; has _header => ( is => 'rw', init_arg => undef ); has _twig => ( is => 'lazy', init_arg => undef ); sub _build__twig { my ($self) = @_; my $twig = 'XML::Twig'->new( twig_handlers => { h3 => sub { $self->_header($_->text); $_->purge; }, 'div[@class="event"]' => sub { say $self->_header, "\t", $_->text; $_->purge; }, }, ); } sub parse { my ($self, $file) = @_; $self->_twig->parsefile($file); } } my $twig = 'XML::Twig::WithHeader'->new; $twig->parse('file.xml');

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Looks very nice, thank you! and it works, at least for my sample XML.

      I'm not familiar with this handler construction:

      'div[@class="event"]'

      It looks rather XSL-ish. Is there some explanation of how that works? The reason I ask (sheepishly) is that my pseudo XML is simpler than the real stuff, meaning it has sub-levels that I want to parse, e.g.:

      <h3 class="current-day">Thursday, February 2</h3> <div class="event"> <div class="title">Event 1</div> <span class="time">7:30pm</span> <span class="location">Main Street</span> </div> <div class="event"> <div class="title">Event 2</div> <span class="time">9pm</span> <span class="location">Green Street</span> </div>

      Sorry not to be more detailed in my original post. Much appreciated.

        > rather XSL-ish

        It's called XPath. It's used and supported in a wider range of tools/languages/libraries than just XSL. This particular expression means "a div element whose class attribute has the value "event".

        > want to parse

        Then you can't use handlers, as you need access to more than just a subtree. The following shows how to do it. Using XML::LibXML would simplify the code in such a case, in my opinion.

        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::Twig; my $twig = 'XML::Twig'->new; $twig->parsefile(shift); my $root = $twig->root; for my $header($root->descendants('h3')) { my $date = $header->text; my @events = $header->next_siblings(sub { my ($elt) = @_; 'div' eq $elt->name && $elt->prev_sibling('h3') == $header }); say join "\t", $date, map $_->text, $_->children for @events; }

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        "I'm not familiar with this handler construction: 'div[@class="event"]'"

        Here's the current W3C Recommendation: "XML Path Language (XPath) 2.0 (Second Edition)".

        In almost all cases, I find the "3.2.4 Abbreviated Syntax" section adequate for my needs. This has a description of 'div[@class="event"]' (as para[@type="warning"]); and lots more besides.

        — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1180218]
Approved by Eily
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-19 19:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found