http://qs321.pair.com?node_id=576872

Good `localtime` monks,
I was trying to find a better way to extract data from XML and I think I might have an idea. Maybe it's stupid, maybe there already is a module with that interface, but maybe not and maybe others will like the idea as well and there will be a point in implementing it.

The basic theme goes like this ... XML is basicaly a serialized data structure and what we want to get as we are parsing it is basicaly a data structure as well. What we get from the parsers is too generic and too complex (DOM, XML::Parser's Tree style) or too restricted (XML::Simple). Either we end up with a structure that's hard to use or we have to restrict the set of XMLs we can handle ... and end up with a structure that likewise may be more complex than necessary. What might help would be a way to specify the rules by which to transform the individual tags (with their attributes and content) to whatever data structure we need to end up with. And apply the rules from the leaves all the way to the root, either producing a simplified datastructure containing just the stuff we are interested in in a format that's convenient enough or process the partial structures as produced by applying the rules.

An example is worth a thousand words, so here's one

$xml = <<'*END*' <doc> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> </person> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> </person> </doc> *END* %rules = ( _default => sub {$_[0] => $_[1]->{_content}}, # by default we are only interested in the content and we want + # the parent to access it as an attribute of the same name as +was the tag bogus => undef, # means "ignore" address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]- +>{country})"}, # let's convert the address to a single string person => sub {'@person' => "$_[1]->{lname}, $_[1]->{fname}\n<$_[1 +]->{email}>\n$_[1]->{address}"} # push the stringified data into the @{$parent->{person}} doc => sub { join( "\n\n", @{$_[1]->{person}})} ); print XML::TransformRules::Parse( $xml, \%rules);
or, a bit more complex
$xml = <<'*END*' <doc> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> <phones> <phone type="home">123-456-7890</phone> <phone type="office">663-486-7890</phone> <phone type="fax">663-486-7000</phone> </phones> </person> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> <phones> <phone type="office">663-486-7891</phone> </phones> </person> </doc> *END* %rules = ( _default => sub {$_[0] => $_[1]->{_content}}, bogus => undef, address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]- +>{country})"}, phone => sub {$_[1]->{type} => $_[1]->{content}}, # let's use the "type" attribute as the key and the content as + the value phones => sub {delete $_[1]->{_content}; %{$_[1]}}, # remove the text content and pass along the type => content f +rom the child nodes person => sub { # lets print the values, all the data is readily a +vailable in the attributes print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n"; print "Home phone: $_[1]->{home}\n" if $_[1]->{home}; print "Office phone: $_[1]->{office}\n" if $_[1]->{office}; print "Fax: $_[1]->{fax}\n" if $_[1]->{fax}; print "$_[1]->{address}\n\n"; return; # the <person> tag is processed, no need to remember w +hat it contained }, );

Even though I talked about transforming the data structure there is nothing preventing us from applying the rules as we parse the document as soon as we encounter the closing tag. So we do not have to load the whole document to memory if we don't need it all at once. And even if we do we have a chance to trim it down as we read it and end up with a much smaller data structure.

The rules receive two parameters, the name of the tag and a hash containing the attributes and the content. For leaf nodes the attributes are the tag attributes and the _content is the textual content of the tag, for other tags it's a bit more complex, the data structure contains stuff returned by the rules of the subtags. The rules may return

  1. nothing (empty list, undef of empty string) - nothing gets added to the parent's data structure
  2. a single string - the string gets appended or pushed to the _content of the parent
  3. a single reference - the parent's _content is converted to an array (if necessary) and the reference is pushed there
  4. an even numbered list - add the keys (odd items) and values (even items) to the parent's data structure, if the key starts with '@' push the value at the end of the array referenced by the key (without the '@'). The value may be a reference.
  5. everything else is an error

Hope the explanation makes sense. So the question is, is there something like this already? Does it make sense? Would you be interested in such a module? What parser should I build this on top of? Should it be a separate module or should I rather try to add this to XML::Parser as yet another style?

Replies are listed 'Best First'.
Re: (RFC) XML::TransformRules
by Belgarion (Chaplain) on Oct 07, 2006 at 18:06 UTC
    You might want to look at XML::Twig for something that looks similar to what you're trying to accomplish.
Re: (RFC) XML::TransformRules
by Jenda (Abbot) on Oct 08, 2006 at 13:47 UTC

    Not much discussion yet :-( Maybe it's the weekend. Maybe I did not explain it well enough. Anyway for now it seems that I WILL implement the module and see if it starts to get used. It'll sit on top of XML::Parser::Expat.

    To allow the rules to both define attributes in the parent's structure and append/push to it's content I'll allow odd numbered lists, the last item will go into the _content.

    Here's yet another set of rules to produce something similar to what would XML::Simple create:

    %rules = ( _default => sub { if (scalar(keys %{$_[1]}) == 1) { return $_[0] => $_[1]->{_content} } else { return $_[0] => $_[1] } }, phone => sub {$_[1]->{type} => $_[1]->{content}}, phones => sub {delete $_[1]->{_content}; 'phones' => $_[1]}, );
Re: (RFC) XML::TransformRules
by dewey (Pilgrim) on Oct 07, 2006 at 17:44 UTC
    I think this is a neat idea, I'll have to go look at XML::Parser and other modules to see how it compares but it definitely piqued my interest.
    ~dewey