http://qs321.pair.com?node_id=642285

logan has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks. I'm trying to parse a chunk of XML code and the problem has become significantly more complex than I'm used to. I am requesting an xml page that describes one or more ads. There can be any number of ads returned, and any number of ads of a specific type. I am OK when there is only one ad of a given type, but multiple ads of a given type is problematic. Here is an example of the xml returned from a request for one Preroll, 3 Midroll, and one Postroll:
- <AdXML> - <Preroll> <Creative>Preroll_30sec</Creative> <CompanionId>N/A</CompanionId> <Impression>TBD</Impression> <Completion>http://192.168.0.1:80/foo/bar</Completion> <TrackingId>null:414</TrackingId> <Length>4</Length> </Preroll> - <Postroll> <Creative>Postroll_60sec</Creative> <CompanionId>N/A</CompanionId> <Impression>TBD</Impression> <Completion>http://192.168.0.1:80/foo/bar</Completion> <TrackingId>null:418</TrackingId> <Length>6</Length> </Postroll> - <Midroll> <Creative>Midroll_45sec_3</Creative> <CompanionId>N/A</CompanionId> <Impression>TBD</Impression> <Completion>http://192.168.0.1:80/foo/bar</Completion> <TrackingId>null:417</TrackingId> <Length>5</Length> </Midroll> - <Midroll> <Creative>Midroll_45sec_1</Creative> <CompanionId>N/A</CompanionId> <Impression>TBD</Impression> <Completion>http://192.168.0.1:80/foo/bar</Completion> <TrackingId>null:415</TrackingId> <Length>5</Length> </Midroll> - <Midroll> <Creative>Midroll_45sec_2</Creative> <CompanionId>N/A</CompanionId> <Impression>TBD</Impression> <Completion>http://192.168.0.1:80/foo/bar</Completion> <TrackingId>null:416</TrackingId> <Length>5</Length> </Midroll> </AdXML>
Using XML::Simple, I can put this all into an object. Run through Data::Dumper, I get this:
Response Dump: $VAR1 = { 'Preroll' => { 'Length' => '4', 'TrackingId' => 'null:414', 'CompanionId' => 'N/A', 'Creative' => 'Preroll_30sec', 'Impression' => 'TBD', 'Completion' => 'http://192.168.0.1:80/foo/bar' }, 'Midroll' => [ { 'Length' => '5', 'TrackingId' => 'null:415', 'CompanionId' => 'N/A', 'Creative' => 'Midroll_45sec_1', 'Impression' => 'TBD', 'Completion' => 'http://192.168.0.1:80/foo/ba +r' }, { 'Length' => '5', 'TrackingId' => 'null:417', 'CompanionId' => 'N/A', 'Creative' => 'Midroll_45sec_3', 'Impression' => 'TBD', 'Completion' => 'http://192.168.0.1:80/foo/ba +r' }, { 'Length' => '5', 'TrackingId' => 'null:416', 'CompanionId' => 'N/A', 'Creative' => 'Midroll_45sec_2', 'Impression' => 'TBD', 'Completion' => 'http://192.168.0.1:80/foo/ba +r' } ], 'Postroll' => { 'Length' => '6', 'TrackingId' => 'null:418', 'CompanionId' => 'N/A', 'Creative' => 'Postroll_60sec', 'Impression' => 'TBD', 'Completion' => 'http://192.168.0.1:80/foo/bar +' } };
What I need to do is walk through the object so I can compare the values for the end parameters (Length, Creative, etc) for each ad with the expected values. The problems are:
  1. I won't know in advance what order the xml elements will be in. It may be Preroll, Midroll, Postroll, or it may be Midroll, Postroll, Preroll. There is no way of knowing in advance.
  2. If there is only one ad returned for a specific ad type, '*roll' will be a hash reference. If there are multiple ads returned, '*roll' will be a reference to an anonymous array of hashes. It is possible to know in advance which ad type will have multiple ads returned and how many there should be.
What I need is an algorithm that will walk the master hash reference and be smart enough to recognize whether it's encountered a simple hash or an array of hashes and act accordingly. I've tried this:
my ($self, $response) = @_; foreach my $asset_type ( keys %{$response} ) { $logger->debug("Starting asset_type $asset_type"); foreach my $asset_param ( keys %{$response->{$asset_type}} ) { $logger->debug("Top of middle FOR loop asset_param = $asset_param: + $response->{$asset_type}->{$asset_param}"); if ( exists ($response->{$asset_type}->{$asset_param}) ) { $logger->debug("\t$asset_type asset_param $asset_param exists: ( +$asset_param) = $response->{$asset_type}->{$asset_param}"); ## LINE 7 +79 } else { $logger->debug("\t$asset_type asset_param $asset_param is an arr +ay reference"); my $i = 0; while ($response->{$asset_type}[$i]) { foreach my $subkey ( keys %{$response->{$asset_type}[$i]}) { $logger->debug("\t\tTesting $asset_type asset number $i (sub +key $subkey) = ($response->{$asset_type}[$i]->{$subkey})"); } $i++; } $logger->debug("Broke innermost WHILE loop asset_param = $asset_ +param"); } $logger->debug("Bottom of middle FOR loop asset_param = $asset_par +am"); } $logger->debug("Broke middle FOR asset param loop"); }
The output is:
- Starting asset_type Preroll - Top of middle FOR loop asset_param = Length: 4 - Preroll asset_param Length exists: (Length) = 4 - Bottom of middle FOR loop asset_param = Length - Top of middle FOR loop asset_param = TrackingId: null:414 - Preroll asset_param TrackingId exists: (TrackingId) = null:414 - Bottom of middle FOR loop asset_param = TrackingId - Top of middle FOR loop asset_param = CompanionId: N/A - Preroll asset_param CompanionId exists: (CompanionId) = N/A - Bottom of middle FOR loop asset_param = CompanionId - Top of middle FOR loop asset_param = Creative: KohlFauPreroll_30sec - Preroll asset_param Creative exists: (Creative) = KohlFauPreroll_3 +0sec - Bottom of middle FOR loop asset_param = Creative - Top of middle FOR loop asset_param = Impression: TBD - Preroll asset_param Impression exists: (Impression) = TBD - Bottom of middle FOR loop asset_param = Impression - Top of middle FOR loop asset_param = Completion: http://172.24.16.84 +:8380/baapi/hics - Preroll asset_param Completion exists: (Completion) = http://172.2 +4.16.84:8380/baapi/hics - Bottom of middle FOR loop asset_param = Completion - Broke middle FOR asset param loop - Starting asset_type Midroll - Top of middle FOR loop asset_param = Length: - Midroll asset_param Length is an array reference - Testing Midroll asset number 0 (subkey Length) = (5) - Testing Midroll asset number 0 (subkey TrackingId) = (null:41 +7) - Testing Midroll asset number 0 (subkey CompanionId) = (N/A) - Testing Midroll asset number 0 (subkey Creative) = (KohlFauMi +droll_45sec_3) - Testing Midroll asset number 0 (subkey Impression) = (TBD) - Testing Midroll asset number 0 (subkey Completion) = (http:// +172.24.16.84:8380/baapi/hics) - Testing Midroll asset number 1 (subkey Length) = (5) - Testing Midroll asset number 1 (subkey TrackingId) = (null:41 +5) - Testing Midroll asset number 1 (subkey CompanionId) = (N/A) - Testing Midroll asset number 1 (subkey Creative) = (KohlFauMi +droll_45sec_1) - Testing Midroll asset number 1 (subkey Impression) = (TBD) - Testing Midroll asset number 1 (subkey Completion) = (http:// +172.24.16.84:8380/baapi/hics) - Testing Midroll asset number 2 (subkey Length) = (5) - Testing Midroll asset number 2 (subkey TrackingId) = (null:41 +6) - Testing Midroll asset number 2 (subkey CompanionId) = (N/A) - Testing Midroll asset number 2 (subkey Creative) = (KohlFauMi +droll_45sec_2) - Testing Midroll asset number 2 (subkey Impression) = (TBD) - Testing Midroll asset number 2 (subkey Completion) = (http:// +172.24.16.84:8380/baapi/hics) - Broke innermost WHILE loop asset_param = Length - Bottom of middle FOR loop asset_param = Length
At that point the program dies with the error "Bad index while coercing array into hash at OO_HttpInterfaceTest.pm line 779." Line 779 in this case is: $logger->debug("\t$asset_type asset_param $asset_param exists: ($asset_param) = $response->{$asset_type}->{$asset_param}");. If I remove the logging statement, the code chokes on line 780 with the same error, which leads me to suspect that the actual error is with the statement "$response->{$asset_type}->{$asset_param}"

The break is happening after the Midroll section is evaluated. The value for 'Completion' is displayed at which point the loop should exit. What seems to be happening is the code tests if $response->{$asset_type}->{$asset_param} exists, finds that it doesn't, and exits rather than going to the else condition. I have no idea why it only does this when transitioning from walking the anonymous array back to a normal hash.

I've been on this for most of the day. Please help! And if there's some vastly easier/less complex way to do this, I'm all ears.

Thanks,

-Logan
"What do I want? I'm an American. I want more."

Replies are listed 'Best First'.
Re: XML::Simple Meets Complex Hash Structure
by GrandFather (Saint) on Oct 03, 2007 at 02:13 UTC

    Switching to a less simple, but simplifying module such as XML::Twig may help. Consider:

    use strict; use warnings; use XML::Twig; my $twig = XML::Twig->new ( twig_handlers => { Preroll => \&handleNode, Midroll => \&handleNode, Postroll => \&handleNode, } ); $twig->parse (do {local $/; <DATA>}); sub handleNode { my ($tag, $elt) = @_; my $creative = $elt->first_child ('Creative'); print "Creative: ", $creative->text (), "\n"; } __DATA__

    Prints:

    Creative: Preroll_30sec Creative: Postroll_60sec Creative: Midroll_45sec_3 Creative: Midroll_45sec_1 Creative: Midroll_45sec_2

    Perl is environmentally friendly - it saves trees
Re: XML::Simple Meets Complex Hash Structure
by throop (Chaplain) on Oct 03, 2007 at 03:35 UTC
    Gentle Logan

    About XML::Simple—In XMLin

    • Use the RootName option to set the root to AdXML. I'll assume you set the ref to the root to $AdXML.
    • Use the ForceArray option on Preroll, Midroll and Postroll.
      • You probably want
        use XML::Simple; my $xs = XML::Simple->new; my $AdXML = $xs->XMLin($adfile, ForceArray=>1, RootName=>'AdXML');
      That way, no matter whether there's one or some other number of instances, $AdXML->{Midroll} etc. will always be an array ref (not a hash ref). (not tested.)
    • With this approach, the number of ads, the arbitrary order of the ads, and the arbitrary order of their attributes is not an issue.
    • If you didn't know about ForceArray, you probably don't know about KeyAttr either. Read the docs on it; as they say, it's important and you'll want to understand it.
    throop
Re: XML::Simple Meets Complex Hash Structure
by djp (Hermit) on Oct 03, 2007 at 05:34 UTC
    ++ to the ForceArray suggestion. You might also try XML-Smart which solves this problem very elegantly.
Solution to: XML::Simple Meets Complex Hash Structure
by logan (Curate) on Oct 04, 2007 at 00:45 UTC
    Thanks to all who weighed in with help. This was my first time working with XML and the mass of available perl libraries of both encouraging and daunting. In the end I used XML::Simple and defined forcearray:
    my $xml_response_object = $xs->XMLin($response->content, forcearray => + 1);
    I kept getting errors when I tried to define rootname, so I just took the default. It's not the cleanest code I've ever written, but it works.

    For the Monks reading this in years to come, here's how it all turned out:

    I parsed the xml response was parsed into an XML::Simple object:

    my $xs = XML::Simple->new; my $xml_response_object = $xs->XMLin($response->content, forcearray => + 1);

    Once parsed, it looked like this:

    print Dumper($response); Response Dump: $VAR1 = { 'Preroll' => [ { 'Length' => [ '4' ], 'TrackingId' => [ 'null:414' ], 'CompanionId' => [ 'N/A' ], 'Creative' => [ 'Preroll_30sec' ], 'Impression' => [ 'TBD' ], 'Completion' => [ 'http://192.168.0.1:80/foo/ +bar' ] } ], 'Midroll' => [ { 'Length' => [ '5' ], 'TrackingId' => [ 'null:416' ], 'CompanionId' => [ 'N/A' ], 'Creative' => [ 'Midroll_45sec_2' ], 'Impression' => [ 'TBD' ], 'Completion' => [ 'http://192.168.0.1:80/foo/ +bar' ] }, { 'Length' => [ '5' ], 'TrackingId' => [ 'null:415' ], 'CompanionId' => [ 'N/A' ], 'Creative' => [ 'Midroll_45sec_1' ], 'Impression' => [ 'TBD' ], 'Completion' => [ 'http://192.168.0.1:80/foo/ +bar' ] }, { 'Length' => [ '5' ], 'TrackingId' => [ 'null:417' ], 'CompanionId' => [ 'N/A' ], 'Creative' => [ 'Midroll_45sec_3' ], 'Impression' => [ 'TBD' ], 'Completion' => [ 'http://192.168.0.1:80/foo/ +bar' ] } ], 'Postroll' => [ { 'Length' => [ '6' ], 'TrackingId' => [ 'null:418' ], 'CompanionId' => [ 'N/A' ], 'Creative' => [ 'Postroll_60sec' ], 'Impression' => [ 'TBD' ], 'Completion' => [ 'http://192.168.0.1:80/foo +/bar' ] } ] };

    I was able to walk the entire object with this code:

    foreach my $asset_type ( keys %{$response} ) { $logger->debug("Starting asset_type $asset_type"); my $i = 0; while ($response->{$asset_type}->[$i]) { $logger->debug("\t$asset_type $i:"); foreach my $param ( keys %{($response->{$asset_type}->[$i])}) { # + each $i is a hash ref $logger->debug("\t\t$param = $response->{$asset_type}->[$i]->{$p +aram}->[0]"); } $i++; } }
    And I was able to access the lowest-level data directly using:

    $response->{$asset_type}->[$i]->{Creative}->[0]

    The logging is handled by Log::Log4perl, which is the logging package I've been looking for for years. The HTTP request is handled by LWP::UserAgent, HTTP::Request, HTTP::Response, and URI::Heuristic, and debugging was vastly aided by Data::Dumper.

    -Logan
    "What do I want? I'm an American. I want more."

Re: XML::Simple Meets Complex Hash Structure
by pajout (Curate) on Oct 03, 2007 at 16:56 UTC
    You can use XML::Trivial too. If you will try it, please, inform me if its "trivial" interface satisfies you. I am just curious if it can satisfy somebody other than me :>))