Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Help with Parsing XML output

by khrome (Initiate)
on Aug 12, 2014 at 03:02 UTC ( #1097061=perlquestion: print w/replies, xml ) Need Help??

khrome has asked for the wisdom of the Perl Monks concerning the following question:

Brief synopsis: I'm pushing a bunch of unstructured data into an API call that returns an XML page. Here's a snipet:

<?xml version="1.0" encoding="UTF-8"?> <results> <url>--removed--</url> <language>english</language> <text>---removed---</text> <taxonomy> <category> <label>/vehicle brands/jeep</label> </category> <category> <label>/travel</label> </category> </taxonomy> <keywords> <keyword> <text>rear extended bumpstops</text> </keyword> </keywords> </results>

So far, I've been using XML::Simple to strip the header, and I've been trying to parse the data as such:

my $xml = new XML::Simple (KeyAttr=>[]); my $TopList = $xml->XMLin($result); $SQL = "INSERT INTO categories (category, url) VALUES(?, ?)"; $SQLX = $dbh->prepare($SQL); if ($TopList->{taxonomy}) { foreach my $cat (@{$TopList->{taxonomy}->{category}}) { $SQLX->execute($cat->{label}, $db_url); } $SQLX->finish(); }

The difficulty is that the XML output is unpredictable. I may not have ANY <category> entries, based on the data. And sometimes, the output comes at me with a 2-word 'keyword' formation, or a hyphenated value. So essentially, I have TWO problems / questions:

1) I need to make a valid check to see if there is actually an entry for the subheading I'm looking for, such as the line above:

if ($TopList->{taxonomy})

This line has never barfed at me, but the next line has, so I need to know if I'm going this check correctly. Reading part 2 might give you a bit more context for this part of my question...

2) The next line in the code barfs at me quite a bit, where I dereference to drill down:

foreach my $cat (@{$TopList->{taxonomy}->{category}})

Sometimes it barks that it's not a Hash, so I change it, and then it barks that it's not an array. During a chat earlier, I was told this is a common problem with XML::Simple. It was suggested that I use XML::Twigs, or ForceArray. I originally was asking if I could simply use an else clause with my foreach statement. Something like:

foreach $var (@{$TopList->{taxonomy}->{category}}) {do something;} else foreach $var ($TopList->[taxonomy}->{category}) {do the something this way;}

The more I think about that else clause, the less it makes sense, but the more I think it would be a quick fix to my problem (thus, further cementing the idea that it won't work). Either way, I haven't gotten the syntax to work yet, so I thought I would ask:

How would the Perl Monks do this?

I should also add, that I'm on a very tight deadline, so quick and simple is better than complicated but elegant.

Help me, Perl Monks, You're My Only Hope!

-Khrome

Replies are listed 'Best First'.
Re: Help with Parsing XML output
by NetWallah (Canon) on Aug 12, 2014 at 03:55 UTC
    Since you are pressed for time, I'd suggest changing your XMLin to:
    my $xml = new XML::Simple (KeyAttr=>[], ForceArray => [ 'category' ],);
    And keeping your "foreach my $cat(.." intact.

            Profanity is the one language all programmers know best.

Re: Help with Parsing XML output
by Anonymous Monk on Aug 12, 2014 at 03:36 UTC
Re: Help with Parsing XML output
by DrHyde (Prior) on Aug 12, 2014 at 10:09 UTC

    It sounds like you don't really want to parse the XML, you want to extract particular bits of data from it. These are subtlely different tasks. IME the best tool for extracting bits of information from XML in perl is XML::XPath.

Re: Help with Parsing XML output
by jellisii2 (Hermit) on Aug 12, 2014 at 12:45 UTC
    Obligatory "use XML::Twig" post from me.

    For a quick and dirty fix for your problem, you could try

    if (ref($var) eq 'ARRAY') { ... } elsif (ref($var) eq 'HASH') { ... } else { ... die "Var is of " . ref($var) . " type and will not be processed"; }
Re: Help with Parsing XML output
by Laurent_R (Canon) on Aug 12, 2014 at 07:29 UTC
    I do not have enough experience with dealing with XML to help you on the mainn subject, but clearly an else clause does not make sense in a foreach statement, since a foreach statement is supposed to scan all the elements of the source list. So either you don't understand foreach, or it is not what you need to start with.
      Hmm, could the  {do something;} be an  if( something )... before the else? I think so :)
        Hmm, not entirely clear, because the syntax is broken anyway, but it seems to me that the {do somethinh } is the block commanded bu the first foreach.
Re: Help with Parsing XML output
by runrig (Abbot) on Aug 12, 2014 at 17:06 UTC
    Here's an XML::Rules solution, though I'm not sure if there may or may not be more than one label per category, or more than one keyword per keywords. I'm assuming "may not", but easy enough to fix if "may" (I would add 'as array no content' rules for label and keyword, and then assume arrays in the parent node's code):
    use strict; use warnings; use XML::Rules; my $xml = <<XML; <?xml version="1.0" encoding="UTF-8"?> <results> <url>--removed--</url> <language>english</language> <text>---removed---</text> <taxonomy> <category> <label>/vehicle brands/jeep</label> </category> <category> <label>/travel</label> </category> </taxonomy> <keywords> <keyword> <text>rear extended bumpstops</text> </keyword> </keywords> </results> XML my @rules = ( category => sub { my $r = $_[1]; print "Category: $r->{label}\n"; }, keywords => sub { my $r = $_[1]; print "Keywords: $r->{text}\n"; }, keyword => 'pass no content', _default => 'content', ); my $xr = XML::Rules->new( rules => \@rules ); $xr->parse($xml);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1097061]
Approved by Athanasius
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2020-11-25 01:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?