Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^3: Apache +XML parsing

by soliplaya (Beadle)
on Nov 09, 2008 at 14:00 UTC ( [id://722477]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Apache +XML parsing
in thread Apache +XML parsing

Thanks Brothers.
Jenda, your XML::Rules module look interesting, and I'd like to give it a try.
What I need to do is fairly simple and boring : I need to parse a multi-level XML document, contained in the scaler $xmldoc, representing a Journal Article (*), into a simple hash like
my $href = { 'TI' => [ 'content of <PubArticle><Article><Title> tag' ], 'AU' => [ 'content of <PubArticle><Article><Authors><Author name="au +thor1" tag', 'content of <PubArticle><Article><Authors><Author name="au +thor2" tag', etc.. ], 'REF' => [ and so on... ] };

The end-result I want thus, is a hash in which each key corresponds to an arrayref, the array containing one or more string elements, these being picked up from tag attributes and/or values from the original XML document. I admit I am a bit lost after the first read of the on-line doc. I guess what I don't see very clearly, from the first example at the head of the doc, is how I get the result in my $href hash.
(*) for a full example of the source XML, use this link : http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=18632282

Replies are listed 'Best First'.
Re^4: Apache +XML parsing
by Jenda (Abbot) on Nov 09, 2008 at 17:55 UTC

    You can use the XML::Rules->inferRulesFromExample() (or XML::Rules->inferRulesFromDTD() if you have the DTD) to get the basic set of rules for the document and see what data structure would that create (using Data::Dumper). Then you can start tweaking the rules to create a nicer structure. For example, if you do not need the author names split into parts and only want the valid ones you can delete the 'Author' from the list of tags with the 'as array' built-in rule and add a rule like this:

    'Author' => sub { return unless $_[1]->{ValidYN} eq 'Y'; return "$_[1]->{ForeName} $_[1]->{LastName}"; },
    and see what structure do you get.

    And then continue tweaking the rules to filter the stuff you are not interested in, format stuff the way you want, rename hash keys etc.

      Hi again.
      I have taken XML::Rules for a ride, and it is really nice. It also looks very fast. All in all, a great tool.
      But it did take me a while to get the inner logic of how it works, despite your evident efforts at writing the documentation. It is in the end excellent as a reference, but as far as I am concerned it lacked a bit in terms of being a tutorial, although I understand that the matter is complex and not easy to explain simply.
      What was not evident, is how one deals with a frequent case where a given <tag> can be a subtag of different parent tags, and needs to be interpreted differently depending on that, like <average-color> below :
      <fruits> <average-color>purple</average-color> <fruit type="banana"> <average-color>yellow</average-color> </fruit> <fruit type="apple"> <average-color>greenish</average-color> </fruit> </fruits>

      I could never get the "somewhat x-path-like" tags to work, so I resorted to :
      'average-color' => sub { if ($_[2]->[-1] eq 'fruit') { # parent tag is <fruit> return 'fruit-average-color' => $_[1]->{_content}; } elsif ($_[2]->[-1] eq 'fruits') { # parent tag is <fruits> return 'global-average-color' => $_[1]->{_content}; } else { # anything else, discard return undef; } },

      Another aspect a bit mysterious is how exactly the following construct actually works :
      'tag' => sub { return '@'.$_[0] => $_[1]->{_content}; },

      I mean, I know it works and returns the contents of <tag>s as an array, but the way in which it does that is a bit mysterious, even after looking at the module code (I must admit that a lot remains mysterious to me, after looking at the code though).

      All in all, thanks for the tip, and thanks for the module. I will re-use it.

        Thanks. I'll see if I can improve the docs. Especially regarding the "x-path-like" stuff. That thing is fairly underdocumented.

        The code without using the "helper" would be

        use XML::Rules; my $parser = XML::Rules->new( stripspaces => 7, rules => { fruit => 'as array', fruits => 'pass no content', 'average-color' => [ 'fruit' => sub {return 'fruit-average-color' => $_[1]->{_c +ontent}}, 'fruits' => sub {return 'global-average-color' => $_[1]->{ +_content}}, sub {}, ] } ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <fruits> <average-color>purple</average-color> <fruit type="banana"> <average-color>yellow</average-color> </fruit> <fruit type="apple"> <average-color>greenish</average-color> </fruit> </fruits>

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://722477]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2024-04-25 00:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found