Re^3: Apache +XML parsing

Thanks Brothers.
Jenda, your XML::Rules module look interesting, and I'd like to give it a try.
What I need to do is fairly simple and boring : I need to parse a multi-level XML document, contained in the scaler $xmldoc, representing a Journal Article (*), into a simple hash like

my $href = {
  'TI' => [ 'content of <PubArticle><Article><Title> tag' ],
  'AU' => [ 'content of <PubArticle><Article><Authors><Author name="au
+thor1" tag',
            'content of <PubArticle><Article><Authors><Author name="au
+thor2" tag',
            etc.. ],
  'REF' => [ and so on... ]
};
[download]

The end-result I want thus, is a hash in which each key corresponds to an arrayref, the array containing one or more string elements, these being picked up from tag attributes and/or values from the original XML document. I admit I am a bit lost after the first read of the on-line doc. I guess what I don't see very clearly, from the first example at the head of the doc, is how I get the result in my $href hash.
(*) for a full example of the source XML, use this link : http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=18632282

Comment on Re^3: Apache +XML parsing Download Code

Replies are listed 'Best First'.

Re^4: Apache +XML parsing
by Jenda (Abbot) on Nov 09, 2008 at 17:55 UTC

You can use the XML::Rules->inferRulesFromExample() (or XML::Rules->inferRulesFromDTD() if you have the DTD) to get the basic set of rules for the document and see what data structure would that create (using Data::Dumper). Then you can start tweaking the rules to create a nicer structure. For example, if you do not need the author names split into parts and only want the valid ones you can delete the 'Author' from the list of tags with the 'as array' built-in rule and add a rule like this:

  'Author' => sub {
    return unless $_[1]->{ValidYN} eq 'Y';
    return "$_[1]->{ForeName} $_[1]->{LastName}";
  },
[download]

And then continue tweaking the rules to filter the stuff you are not interested in, format stuff the way you want, rename hash keys etc.

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]
[select]

Re^5: Apache +XML parsing

by soliplaya (Beadle) on Nov 18, 2008 at 10:53 UTC

<fruits>
  <average-color>purple</average-color>
  <fruit type="banana">
    <average-color>yellow</average-color>
  </fruit>
  <fruit type="apple">
    <average-color>greenish</average-color>
  </fruit>
</fruits>
[download]

  'average-color' => sub {
      if ($_[2]->[-1] eq 'fruit') {
        # parent tag is <fruit>
        return 'fruit-average-color' => $_[1]->{_content};
      } elsif ($_[2]->[-1] eq 'fruits') {
        # parent tag is <fruits>
        return 'global-average-color' => $_[1]->{_content};
      } else {
        # anything else, discard
        return undef;
      }
   },
[download]

  'tag' => sub {
     return '@'.$_[0] => $_[1]->{_content};
  },
[download]

[reply]
[d/l]
[select]

Re^6: Apache +XML parsing

by Jenda (Abbot) on Nov 18, 2008 at 15:21 UTC

Thanks. I'll see if I can improve the docs. Especially regarding the "x-path-like" stuff. That thing is fairly underdocumented.

The code without using the "helper" would be

use XML::Rules;

my $parser = XML::Rules->new(
    stripspaces => 7,
    rules => {
        fruit => 'as array',
        fruits => 'pass no content',
        'average-color' => [
            'fruit' => sub {return 'fruit-average-color' => $_[1]->{_c
+ontent}},
            'fruits' => sub {return 'global-average-color' => $_[1]->{
+_content}},
            sub {},
        ]
    }
);

my $data = $parser->parse(\*DATA);
use Data::Dumper;

print Dumper($data);

__DATA__
<fruits>
  <average-color>purple</average-color>
  <fruit type="banana">
    <average-color>yellow</average-color>
  </fruit>
  <fruit type="apple">
    <average-color>greenish</average-color>
  </fruit>
</fruits>
[download]

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks