Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

XML & data structure parsing fun (XML::Simple ??)

by kabeldag (Hermit)
on Jun 05, 2008 at 06:35 UTC ( [id://690337]=perlquestion: print w/replies, xml ) Need Help??

kabeldag has asked for the wisdom of the Perl Monks concerning the following question:

I have never had the need/requirement/want to deal with any XML before. At least not in any major way. However, I do now, and have a few questions.

Firstly, let me give the basic scenario:

The XML in question could be anything from semi well-formed/created to well-formed/created. Secondly, let's assume that the XML elements from root to a max depth of 4 are known, and we are erring on the side of caution in that the resulting XML::Simple structure may be a large mix of Hash's and Array's at different depths.

Thirdly, some of the element values will vary in size (but the total size of the XML tree itself will usually never exceed 2MB), and there will be multiple sub-element containers of the same name: I am using XMLin() without any major modifiers that will change the resulting structure.

The script I have written below works well enough with the test XML in the XML_RAW heredoc. But, before I start going too far, what suggestions does anyone have? Show me some other methods for getting 'concise' data from an XML tree :-)

use strict; use warnings; use XML::Simple; my $xml_raw = <<XML_RAW; <survey> <animals srcurl="blah.whatever.blah" method="ftp"> <fish name="barramundi" freshwater="yes" saltwater="yes"> <river>Todd</river> <river>Katherine</river> </fish> <fish name="carp" freshwater="yes" saltwater="no"> <river>Tilbuster Ponds</river> <river>Maribyrnong</river> <river>Patterson</river> <river>Paterson</river> <river>Glenelg</river> <river>Murray</river> <river>Bunyip</river> <river>Campaspe</river> </fish> <fish name="yellowfin" freshwater="yes" saltwater="no"> <river>Eucumbene</river> <river>Mulla Mulla Creek</river> <river>Burrungubugge</river> <river>Goobarragandra</river> <river>Bombala</river> <river>Murray</river> <river>Emu Swamp Creek</river> </fish> </animals> </survey> XML_RAW my $xml_hash_ref = XMLin($xml_raw, KeepRoot=>1); my %xml_hash = %{$xml_hash_ref}; my ($tl_hk, $tl_hv) = each %xml_hash; my $last_key = ''; my @key_stash = (); my $ref_type = ''; my $fish_species = ''; my $fish_survey_dump =""; # Just to show you how XML::Simple has structured the XML into a hash #use Data::Dumper; #print Dumper(\%xml_hash); traverse_hash($xml_hash{$tl_hk}, $tl_hk); # Print out the fish survey information that we wanted. # I concatenated it into a scalar just for quick display purposes print "\n\n$fish_survey_dump\n"; sub traverse_hash { my ($hash_val, $last_key) = @_; push(@key_stash, "$last_key ->"); for my $key (keys %{$hash_val}) { $ref_type = ref($hash_val->{$key}) || "VALUE"; print "$ref_type: @key_stash $key -> ", $hash_val->{$key}," +\n"; if($ref_type eq 'HASH') { if($key=~/barramundi|carp|yellowfin/) { $fish_species = $key; concat("\n\n[ Survey information for: $fish_species ]: +\n\n"); concat("Saltwater:" . $hash_val->{$fish_species}{'salt +water'} . "\n"); concat("Freshwater:" . $hash_val->{$fish_species}{'fre +shwater'} . "\n"); concat("Rivers covered in survey:\n\n"); for my $river (@{$hash_val->{$fish_species}->{'river'} +}) { concat("$river\n"); } } $last_key = $key; # Loop through any sub hash's by calling traverse_hash() a +gian. traverse_hash($hash_val->{$key}, $last_key); pop(@key_stash); }elsif($ref_type eq 'ARRAY') { # Array reference traverse_array($key, @{$hash_val->{$key}}); }else{ # Hash value; # ... } } } sub traverse_array { my ($key, @array) = @_; for my $array_val (@array) { print "ARRAY-VAL: @key_stash $key -> ", $array_val,"\n"; if(ref($array_val) eq 'HASH') { traverse_hash($array_val, undef); } } } sub concat { my $string = $_[0]; $fish_survey_dump .= $string; }
The script above gives the following:

HASH: survey -> animals -> HASH(0x1ad678c) VALUE: survey -> animals -> srcurl -> blah.whatever.blah VALUE: survey -> animals -> method -> ftp HASH: survey -> animals -> fish -> HASH(0x1b4262c) HASH: survey -> animals -> fish -> carp -> HASH(0x1b425f0) ARRAY: survey -> animals -> fish -> carp -> river -> ARRAY(0x1b4272 +8) ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Tilbuster +Ponds ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Maribyrnon +g ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Patterson ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Paterson ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Glenelg ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Murray ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Bunyip ARRAY-VAL: survey -> animals -> fish -> carp -> river -> Campaspe VALUE: survey -> animals -> fish -> carp -> saltwater -> no VALUE: survey -> animals -> fish -> carp -> freshwater -> yes HASH: survey -> animals -> fish -> barramundi -> HASH(0x1b425e4) ARRAY: survey -> animals -> fish -> barramundi -> river -> ARRAY(0x +1b4277c) ARRAY-VAL: survey -> animals -> fish -> barramundi -> river -> Todd ARRAY-VAL: survey -> animals -> fish -> barramundi -> river -> Kath +erine VALUE: survey -> animals -> fish -> barramundi -> saltwater -> yes VALUE: survey -> animals -> fish -> barramundi -> freshwater -> yes HASH: survey -> animals -> fish -> yellowfin -> HASH(0x1b425fc) ARRAY: survey -> animals -> fish -> yellowfin -> river -> ARRAY(0x1 +b4268c) ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Eucum +bene ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Mulla + Mulla Creek ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Burru +ngubugge ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Gooba +rragandra ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Bomba +la ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Murra +y ARRAY-VAL: survey -> animals -> fish -> yellowfin -> river -> Emu S +wamp Creek VALUE: survey -> animals -> fish -> yellowfin -> saltwater -> no VALUE: survey -> animals -> fish -> yellowfin -> freshwater -> yes [ Survey information for: carp ]: Saltwater:no Freshwater:yes Rivers covered in survey: Tilbuster Ponds Maribyrnong Patterson Paterson Glenelg Murray Bunyip Campaspe [ Survey information for: barramundi ]: Saltwater:yes Freshwater:yes Rivers covered in survey: Todd Katherine [ Survey information for: yellowfin ]: Saltwater:no Freshwater:yes Rivers covered in survey: Eucumbene Mulla Mulla Creek Burrungubugge Goobarragandra Bombala Murray Emu Swamp Creek

Replies are listed 'Best First'.
Re: XML & data structure parsing fun (XML::Simple ??)
by GrandFather (Saint) on Jun 05, 2008 at 06:58 UTC

    You should consider XML::Twig, XML::TreeBuilder, XML::LibXML and other among the very many modules in CPAN's XML name space.

    XML::Simple seems to generate as many questions here as any other single module and most of the answers recommend a different XML module. XML::Simple is simple for a very small range of tasks that are a good match for its default behavior. After that it tends to become XML::Warty.


    Perl is environmentally friendly - it saves trees
Re: XML & data structure parsing fun (XML::Simple ??)
by Jenda (Abbot) on Jun 05, 2008 at 09:00 UTC

    It's better to give XML::Simple a few hints so that it creates a datastructure that's consistent and fine to work with.

    use strict; use warnings; use XML::Simple; my $xml_raw = <<XML_RAW; <survey> ... </survey> XML_RAW my $data = XMLin($xml_raw, ForceArray => [qw(river fish)], KeyAttr => +[]); foreach my $Animal (@{$data->{animals}{fish}}) { print <<"*END*"; [ Survey information for: $Animal->{name} ]: Saltwater:$Animal->{saltwater} Freshwater:$Animal->{freshwater} Rivers covered in survey: *END* for (@{$Animal->{river}}) { print $_, "\n"; } print "\n"; }
    or you could use XML::Rules and print the report as the file is being parsed:
    use strict; use warnings; use XML::Rules; my $xml_raw = <<XML_RAW; <survey> ... </survey> XML_RAW my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => '', river => 'content array', fish => sub { print <<"*END*"; [ Survey information for: $_[1]->{name} ]: Saltwater:$_[1]->{saltwater} Freshwater:$_[1]->{freshwater} Rivers covered in survey: *END* for (@{$_[1]->{river}}) { print $_, "\n"; } print "\n"; }, } ); $parser->parse($xml_raw);
    or
    my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => '', river => sub {'.river' => "$_[1]->{_content}\n"}, fish => sub { print <<"*END*"; [ Survey information for: $_[1]->{name} ]: Saltwater:$_[1]->{saltwater} Freshwater:$_[1]->{freshwater} Rivers covered in survey: $_[1]->{river} *END* }, } ); $parser->parse($xml_raw);

Re: XML & data structure parsing fun (XML::Simple ??)
by Cody Pendant (Prior) on Jun 05, 2008 at 07:05 UTC
    Not very helpful, but I feel compelled to say that there's really no such thing as "XML which is not well formed". If it's not well formed, it's not XML you can successfully handle with an XML Parser.

    And you should probably consider using XML::XSLT because XSLT is the specific language created to handle XML. It's a language only a mother could love, but it's designed for the purpose. Not that perl isn't a perfectly good tool for this job, but XSLT is a tool designed to do nothing else but this job.



    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
      I guess he means XML documents with a flexible structure. E.g. a value is missing, there is one value or there are several values. I agree that XML::Simple is only good for simple stuff - expescially the stability gets pretty bad when the structure is very flexible and not static.
Re: XML & data structure parsing fun (XML::Simple ??)
by kabeldag (Hermit) on Jun 05, 2008 at 09:54 UTC
    GrandFather, I agree. I'm not sure about Perl 5.10.x, but XML::Simple is stock on 5.8.8 (I'm fairly sure - better double check that), which is why I chose it. That and the fact that it is fairly lite and wasn't much fuss.

    "I agree that XML::Simple is only good for simple stuff - expescially the stability gets pretty bad when the structure is very flexible and not static."

    It sure does smell that way, weismat. Jenda mentions the 'Hint' option for XMLin(), which wouldn't be so bad in this instance, maybe, but I'd prefer not to ever expect anything.

      At one point or another you are going to need to know the structure of the XML you are going to work with anyway. The fact that DOM or some other maze-of-objects-based XML parser let's you get away with assuming a single occurance of some tag even though it may be repeated, while XML::Simple would cause your script to bomb out, is not always a good thing. I'd rather if my stuff failed noisily than if it'd produce incorrect results.

      You can of course split the XML processing into two, completely unrelated phases. First, that knows nothing whatsoever about what the structure of the XML is and what data are you after and just blindly parses (well, it's not much more then lexing actually this way) the file into some kind of data/object structure. And second that does know where are the data you wanted and has to navigate the structure to get them. In my humble opinion this is often unnecessarily hard and inefficient.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://690337]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (4)
As of 2024-04-23 05:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found