Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Parsing generic XML

by aknipp (Initiate)
on Jun 24, 2011 at 16:22 UTC ( [id://911271]=perlquestion: print w/replies, xml ) Need Help??

aknipp has asked for the wisdom of the Perl Monks concerning the following question:

OK, I am writing a program to grab an XML parse, possibly update). So I am using XML::Simple. My question is how can I, in a generic form reliably access the data if it is a hash or data or arrays ?

my $file1=shift; use XML::Simple; use Data::Dumper; my $xml; open (IN, "<$file1") || die "[r] '$file1' ($!)"; { local $/ = undef; $xml=<IN>; } close(IN); my $ref = XMLin($xml); foreach (keys %{$ref}) { print $_." ".$ref->{$_}." "."\n"; }
I have tried (and other permutations): scalar @$ref->{$_} (exists $ref->{$_}[0]) (defined $ref->{$_}[0]) Some just return the data, ones with [0] break when it is on a node that just returns the data. Tips/Pointers/RTFMs appreciated Andy

Replies are listed 'Best First'.
Re: Parsing generic XML
by wind (Priest) on Jun 24, 2011 at 16:31 UTC

    If you're asking how to access the data structure returned by XML::Simple, that entirely depends on the structure of your XML. There is nothing wrong with the code that you've listed above, but if you're wanting to know why $ref->{$_} isn't listing deeper contents, maybe it's because it's a more complex data structure than just a scalar?

    If so, just use Data::Dumper to output it or use ref to determine what type it is and use recursion:

    foreach (keys %{$ref}) { print $_." ".Dumper($ref->{$_})."\n"; }

    Note, there are parameters that you can pass to XMLin to adjust the way that it creates the data structure from the source XML. Just read the cpan docs for more details.

      Thanks. I will try your code. The XML I plan to use is variable, so I was trying to write something rather generic. If I can identify a structure vs a node I should be OK.

Re: Parsing generic XML
by Khen1950fx (Canon) on Jun 24, 2011 at 18:13 UTC
    XML::Simple has a strict mode. You'll want to get in the habit of using it.
    #!/usr/bin/perl use strict; use warnings; use XML::Simple qw(:strict); use Data::Dumper::Concise; my $file = shift @ARGV; my $xml = $file; open IN, '<', $xml or die $!; { local $/ = undef; $xml = <IN>; } close IN; my $ref = XMLin($xml, KeyAttr => {item => 'name'}, ForceArray => [ 'item' ], ContentKey => '-content' ); foreach (keys %{$ref}) { print Dumper("$_ = " . $ref->{$_}), "\n"; }
Re: Parsing generic XML
by ikegami (Patriarch) on Jun 24, 2011 at 17:25 UTC
    # Reference to array of foo elements. my $foos = (ref($node->{foo}) // '') eq 'ARRAY' ? $node->{foo} : [ $node->{foo} ];
    or
    XMLin($xml, ForceArray => 1); ... # Reference to array of foo elements. my $foos = $node->{foo};
Re: Parsing generic XML
by graff (Chancellor) on Jun 25, 2011 at 01:34 UTC
    Since (according to one of your replies above) the xml input is "variable", you might be interested in the following, which I wrote a while back just to be able to summarize xml tag structures in a generic way.

    I prefer "low level" xml modules like XML::Parser and XML::LibXML, because for some reason I find that they are actually easier for me to learn, compared to the "refined sugar" approaches like XML::Simple and XML::Twig; I don't mind writing a few extra lines of code, given that I'm able to understand more quickly what the code is really doing.

    As for going beyond simple summarization and updating content, I think LibXML would be the tool I'd prefer.

    #!/usr/bin/perl use strict; use XML::Parser; my $Usage = "$0 [-r] [-b] file.xml\n"; my ( $add_root, $count_attribs, $discrete_count ); while ( @ARGV > 1 and $ARGV[0] =~ /^-([abr])$/ ) { if ( $1 eq 'r' ) { $add_root = shift; } elsif ( $1 eq 'a' ) { $count_attribs = shift; } else { $discrete_count = shift; } } die $Usage unless ( @ARGV == 1 and -f $ARGV[0] ); my %embedding; my $key = ''; my %ehist; my %ahist; my $p = XML::Parser->new( Handlers => { Start => sub{ my $newkey = "$key/$_[1]"; if ( $key and $discrete_coun +t and !exists( $embedding{$ke +y} )) { $embedding{$key}++; $ehist{$key}--; } $key = $newkey; $ehist{$key}++; if ( $count_attribs ) { for ( my $i=2; $i<$#_; $ +i+=2 ) { $ahist{$key}{$_[$i]} +++; } } }, End => sub{ delete $embedding{$key} if ( + $discrete_count ); $key =~ s{/$_[1]$}{} }, } ); if ( ! $add_root ) { $p->parsefile( $ARGV[0] ); } else { my $xmlstr = "<STRUCT_HIST_ROOT_$$>\n"; open( X, '<:utf8', $ARGV[0] ) or die "Unable to read $ARGV[0]: $!\ +n"; { local $/ = undef; $xmlstr .= <X>; } close X; $xmlstr .= "</STRUCT_HIST_ROOT_$$>"; $p->parse( $xmlstr ); } for my $k ( sort keys %ehist ) { $_ = $k; if ( $add_root ) { s{/STRUCT_HIST_ROOT_$$}{}; next unless /\S/; } next if ( $discrete_count and $ehist{$k} <= 0 ); print "$ehist{$k}\t$_\n"; if ( $count_attribs ) { print "\t$ahist{$k}{$_}\t\@$_\n" for ( sort keys %{$ahist{$k}} + ); } } =head1 NAME xml-structure-hist =head1 SYNOPSIS xml-structure-hist [-r] [-a] [-b] file.xml -r : have the program supply a root node tag -a : tabulate element attributes (only on raw element counts) -b : count only "bottom-level" paths (def: also count intermed.paths + ) =head1 DESCRIPTION For any given xml file, this tool will use a standard xml parser to tabulate the structure of the tags and print (on STDOUT) a tally of how many times each distinct structural element occurs in the file. Use the "-r" option if the input file does not include its own "root" xml tag (e.g. when multiple blocks of similar xml data are concatenate +d without a wrapper tag being put around them). For example, given an xml file with these contents: <root_node> <level1 id="x"> <level2_a><level3 x="y">...</level3><level3>...</level3></level2_a> <level2_a><level3 x="z">...</level3><level3>...</level3></level2_a> </level1> <level1 id="y"> <level2_a><level3 x="w"><level4>...</level4>...</level3></level2_a> <level2_b><level3 x="x">...</level3></level2_b> </level1> <level1 id="z"> <level2_a>...</level2_a> </level1> </root_node> the default output would be: 1 /root_node 3 /root_node/level1 4 /root_node/level1/level2_a 5 /root_node/level1/level2_a/level3 1 /root_node/level1/level2_a/level3/level4 1 /root_node/level1/level2_b 1 /root_node/level1/level2_b/level3 With tha "-a" option, the output would be: 1 /root_node 3 /root_node/level1 3 @id 4 /root_node/level1/level2_a 5 /root_node/level1/level2_a/level3 3 @x 1 /root_node/level1/level2_a/level3/level4 1 /root_node/level1/level2_b 1 /root_node/level1/level2_b/level3 1 @x With the "-b" option, the output would be: 1 /root_node/level1/level2_a 4 /root_node/level1/level2_a/level3 1 /root_node/level1/level2_a/level3/level4 1 /root_node/level1/level2_b/level3 If the example lacked the "root_node" tags, you would use the "-r" option, and the quantities reported for the "level*" tags would be the same as above. =head1 AUTHOR David Graff <graff at ldc.upenn.edu> =cut
Re: Parsing generic XML
by grantm (Parson) on Jun 25, 2011 at 00:26 UTC
    This post will give you more info on ways to use XML::Simple and how to achieve the same things with XML::LibXML.
Re: Parsing generic XML
by sundialsvc4 (Abbot) on Jun 25, 2011 at 00:50 UTC

    Casting a recommend here for XML::Twig.   I have consistently found that, when XML is concerned (and especially if they are really big files ... or might become so ...) the “big guns” are the best.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://911271]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (1)
As of 2024-04-25 01:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found