Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

XML::Twig help please

by Binford (Sexton)
on Oct 14, 2009 at 15:08 UTC ( [id://801108]=perlquestion: print w/replies, xml ) Need Help??

Binford has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am trying to parse a VERY large XML file of the form:
<?xml version="1.0" encoding="UTF-8"?> <authenticationReports> <generatedTime>Tue Sep 29 07:07:34 PDT 2009</generatedTime> <appDeploymentFile name="app-deployment.properties.hklcp.trading"> <application name="hk"> <urlInfo> <url>a/b/hk/accts_subscription</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/accts_forms</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/custtradingpage</url> <otherPrereq>BasicPrereq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/accts_userinfo</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/headermain</url> </urlInfo> <urlInfo> <url>a/b/hk/custservicepage</url> </urlInfo> <urlInfo> <url>a/b/hk/accts_transfermoney</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/userprereq</url> </urlInfo> <urlInfo> <url>a/b/hk/indices_us</url> </urlInfo> <urlInfo> <url>a/b/hk/homeloggedmessage</url> </urlInfo> <urlInfo> <url>a/b/hk/lead</url> </urlInfo> <urlInfo> <url>a/b/hk/orderviewmin</url> </urlInfo> <urlInfo> <url>a/b/hk/accts_changelogin</url> <otherPrereq>SessionPreReq</otherPrereq> </urlInfo> </application> </appDeploymentFile>
I want to create a hash with <url> as the key and <appDeploymentFile> as value for further processing. I've tried all sorts of XPATH values in twig_handler for it, but just can't seem to figure it out. Any help appreciated!

Replies are listed 'Best First'.
Re: XML::Twig help please
by mirod (Canon) on Oct 14, 2009 at 16:18 UTC

    When you say <appDeploymentFile> as value I assume you want the value of the name attribute for the englobing appDeploymentFile element:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my %url2file; XML::Twig->new( twig_handlers => { url => sub { $url2file{$_->text}= $ +_->parent( 'appDeploymentFile')->att( 'name'); $_->purge; } } ) ->parsefile( "my_data.xml"); use YAML::Syck; print Dump( \%url2file);

    Is this what you're looking for?

    updated: give the result hash a meaningful name

      Thanks for the help guys, I was almost there, when requirements changed slightly. The above code you gusy gave me worked except in the following instance:
      <authenticationReports> <generatedTime>Tue Sep 29 07:07:34 PDT 2009</generatedTime> <appDeploymentFile name="app-deployment.properties.hklcp.trading"> <application name="hk"> <urlInfo> <url>a/b/hk/accts_subscription</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/accts_forms</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/custtradingpage</url> <otherPrereq>BasicPrereq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/accts_userinfo</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/headermain</url> </urlInfo> <urlInfo> <url>a/b/hk/custservicepage</url> </urlInfo> <urlInfo> <url>a/b/hk/accts_transfermoney</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/userprereq</url> </urlInfo> <urlInfo> <url>a/b/hk/indices_us</url> </urlInfo> <urlInfo> <url>a/b/hk/homeloggedmessage</url> </urlInfo> <urlInfo> <url>a/b/hk/lead</url> </urlInfo> <urlInfo> <url>a/b/hk/orderviewmin</url> </urlInfo> <urlInfo> <url>a/b/hk/accts_changelogin</url> <otherPrereq>SessionPreReq</otherPrereq> </urlInfo> </application> <application name="intl"> <urlInfo> <url>a/b/intl/quotesandresearch</url> </urlInfo> <urlInfo> <url>a/b/intl/intltablesubnavviewcomponent</url> </urlInfo> <urlInfo> <url>a/b/intl/intltablemetaviewcomponent</url> </urlInfo> <urlInfo> <url>a/b/intl/disclaimer</url> </urlInfo> <urlInfo> <url>a/b/intl/headermain</url> </urlInfo> <urlInfo> <url>a/b/intl/indices_us</url> </urlInfo> <urlInfo> <url>a/b/intl/lead</url> </urlInfo> <urlInfo> <url>a/b/intl/selectlanguage</url> </urlInfo> <urlInfo> <url>a/b/intl/get-screen</url> <otherPrereq>BasicPrereq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/intl/page_f</url> </urlInfo> <urlInfo> <url>a/b/intl/basicprereq</url> </urlInfo> <urlInfo> <url>a/b/intl/page</url> <otherPrereq>BasicPrereq</otherPrereq> </urlInfo> </application> </appDeploymentFile> </authenticationReports>
      The following code reads the <appDeploymentFile> URL's but only for the first section of <application> IOW, I get all the URL's for the section <application name ="hk"> but none for the <application name ="intl">. Here's my code now:
      sub AFXMLtoEM { print "Slurping $AFXML...."; my $TWIG = new XML::Twig ( twig_handlers => {'appDeploymentFile' = +> \&parseURL} ); #my $TWIG = new XML::Twig ( twig_handlers => {'appDeploymentFile/a +pplication' => \&parseURL} ); $TWIG -> parsefile ($AFXML) or die "Can't open $AFXML\n" ; $TWIG->flush; # Now we want to change every value from the XML name to an EM ins +tance identifier #print Dumper(\%AFURLS); exit 1; while ((my $K, my $ITEM) = each %AFURLS) { my ($G1,$G2,$APP,$INST) = split /\./,$ITEM,4; unless ($APP eq "") { $ITEM = "prd:" . $APP . ":web:" . $INST; } #Cheesy kludge - fiox when Durai confirms $AFURLS{$K} = $ITEM; } print scalar keys %AFURLS, " records slurped in.\n"; } sub parseURL { my ($T, $ADEP) = @_; my $NAME = $ADEP->att('name'); for my $URLI ($ADEP->first_child('application')->children('urlInfo +')) { #for my $URLI ($ADEP->children('urlInfo')) { # leading slash added for matching SM filters $AFURLS{ "/" . $URLI->first_child('url')->text() } = $NAME; } #$ADEP->flush; }
      How can I 1. Get all URL's in any given <appDeploymentFile> section and 2. append the <application> NAME value to the end of the $NAME (value) of the Hash? I can then parse it later. I played with various child/next_child, etc parameters, and just ain't grokking it yet. Thanks for the input. First time, I've tried to use XML::Twig before... XML::Simple had always met my needs.
Re: XML::Twig help please
by toolic (Bishop) on Oct 14, 2009 at 16:09 UTC
    Undoubtedly, there is a cleaner xpath way to do this, but I think this should help. This assumes that every url is unique (otherwise, you'll be clobbering urls).
    use strict; use warnings; use XML::Twig; use Data::Dumper; my $xmlStr = <<XML; <?xml version="1.0" encoding="UTF-8"?> <authenticationReports> <generatedTime>Tue Sep 29 07:07:34 PDT 2009</generatedTime> <appDeploymentFile name="app-deployment.properties.hklcp.trading"> <application name="hk"> <urlInfo> <url>a/b/hk/accts_subscription</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/accts_forms</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/custtradingpage</url> <otherPrereq>BasicPrereq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/accts_userinfo</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/headermain</url> </urlInfo> <urlInfo> <url>a/b/hk/custservicepage</url> </urlInfo> <urlInfo> <url>a/b/hk/accts_transfermoney</url> <otherPrereq>HKPwdPreReq</otherPrereq> </urlInfo> <urlInfo> <url>a/b/hk/userprereq</url> </urlInfo> <urlInfo> <url>a/b/hk/indices_us</url> </urlInfo> <urlInfo> <url>a/b/hk/homeloggedmessage</url> </urlInfo> <urlInfo> <url>a/b/hk/lead</url> </urlInfo> <urlInfo> <url>a/b/hk/orderviewmin</url> </urlInfo> <urlInfo> <url>a/b/hk/accts_changelogin</url> <otherPrereq>SessionPreReq</otherPrereq> </urlInfo> </application> </appDeploymentFile> </authenticationReports> XML my %urls; my $twig= new XML::Twig( twig_handlers => { 'appDeploymentFile' => \&appdep } ); $twig->parse($xmlStr); print Dumper(\%urls); exit; sub appdep { my ($t, $adep) = @_; my $name = $adep->att('name'); for my $urli ($adep->first_child('application')->children('urlInfo +')) { $urls{ $urli->first_child('url')->text() } = $name; } } __END__ $VAR1 = { 'a/b/hk/accts_subscription' => 'app-deployment.properties.hk +lcp.trading', 'a/b/hk/accts_changelogin' => 'app-deployment.properties.hkl +cp.trading', 'a/b/hk/lead' => 'app-deployment.properties.hklcp.trading', 'a/b/hk/orderviewmin' => 'app-deployment.properties.hklcp.tr +ading', 'a/b/hk/headermain' => 'app-deployment.properties.hklcp.trad +ing', 'a/b/hk/custservicepage' => 'app-deployment.properties.hklcp +.trading', 'a/b/hk/custtradingpage' => 'app-deployment.properties.hklcp +.trading', 'a/b/hk/accts_userinfo' => 'app-deployment.properties.hklcp. +trading', 'a/b/hk/accts_transfermoney' => 'app-deployment.properties.h +klcp.trading', 'a/b/hk/userprereq' => 'app-deployment.properties.hklcp.trad +ing', 'a/b/hk/homeloggedmessage' => 'app-deployment.properties.hkl +cp.trading', 'a/b/hk/indices_us' => 'app-deployment.properties.hklcp.trad +ing', 'a/b/hk/accts_forms' => 'app-deployment.properties.hklcp.tra +ding' };
      Thanks guys! That worked, but I still am not grokking it fully.
Re: XML::Twig help please
by Jenda (Abbot) on Oct 15, 2009 at 13:56 UTC

    If you do not insist on using XML::Twig then with the use of a global variable:

    use strict; use XML::Rules; my %data; my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => '', url => 'content', urlInfo => sub { my ($tag,$attr,$context,$parents) = @_; $data{ $attr->{url} } = $parents->[-2]{name} }, } ); $parser->parse(\*DATA); use Data::Dumper; print Dumper(\%data); __DATA__ <authenticationReports> <generatedTime>Tue Sep 29 07:07:34 PDT 2009</generatedTime> ...
    and without any globals so that the parse() returns the created hash:
    use strict; use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, rules => { _default => 'content', urlInfo => sub { return '@url' => $_[1]->{url}; }, application => sub { return 'url' => $_[1]->{url}; }, appDeploymentFile => sub { return '%urls' => { map {$_ => $_[1]->{name}} @{$_[1]->{url}} } }, authenticationReports => sub { return $_[1]->{urls} } } ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <authenticationReports> <generatedTime>Tue Sep 29 07:07:34 PDT 2009</generatedTime> ...

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://801108]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-04-23 23:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found