Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Ugly XML processing looking for a pure XML solution

by mirod (Canon)
on Dec 14, 2000 at 19:18 UTC ( [id://46623]=perlquestion: print w/replies, xml ) Need Help??

mirod has asked for the wisdom of the Perl Monks concerning the following question:

OK, never one too miss an opportunity to contradict myself here is an XML problem that I could not solve using XML techniques, and which forced me to fall back on fancy regexp work. Let's see if anybody can come up with a pure XML solution:

I am converting MIF (FrameMaker Interchange Format) files to XML. so I first do some deep magic to get the data into some kind of very flat XML, basically turning all Frame styles into XML tags. Now I have to add the "superstructure" to the file, all the englobing elements. To do this I used a mechanism similar to what Frame does with their own "Conversion Tables": I describe the content of wrapping tags with a regexp-like syntax. Then off to processing the file...

The initial XML is something like:

<doc><Ch_Chair>Ms Foo</Ch_Chair> <!-- names have been changed to prote +ct the victims identity --> <Ch_Chair>Mr Bar</Ch_Chair> <!-- but there are actually _NO_ comm +ents in the file --> <CommitteeList>Mss Toto</CommitteeList> <CommitteeList>Dr Tata</CommitteeList> <chunk /> <!-- I break the document in chunks so processing is fa +ster --> <!-- other chunks --> <chunk /> </doc>

And I want a result like:

<doc><perslist> <officers> <person>Ms Foo</person> <person>Mr Bar</person> </officers> <person>Mss Toto</person> <person>Dr Tata</person> </perslist> <!-- stuff --> </doc>

The tables describing the transformation are:

my %change=( 'CommitteeList' => 'person', # CommitteeL +ist tag should become person 'Ch_Chair' => 'person[officer]'); # Ch_Chair t +ag should become person too # but consec +utive officers should then be # wrapped in + an officer tag my %wrap= ( officers => 'person[officer]+', # this defin +es the wrapping perslist => 'officers?, person+'); # then we wr +ap the whole list in a perslist tag my @wrap=( 'officers', 'perslist'); # to process + them in the proper order

Seems easy doesn't it? Well here is what I do to process this thing (I wrote it quite a while ago so the style is not great, please bear with it):

my %wrapper; # stores the regular expression generated from % +wrap local $/="<chunk />"; # I have previously inserted those tags so I can + split my file in independant chunks while(my $bit= <$infile>) { # the chunk tag is removed here, amongst other irrelevant things foreach my $tag ( keys %change) { change_tag( $bit, $tag, $change{$tag}); } # looks easy isn't i +t? foreach my $tag (@wrap) { wrap( $bit, $tag, $wrap{$tag}); } # easy too! # remove the extra er attributes (amongst other things) here print $bit; # spit it out } # this one is easy # change source_tag into target_tag, adding an er attribute if specifi +ed sub change_tag($$$) { my $source_tag = $_[1] || warn "no source_tag"; my $target_tag= $_[2] || warn "no target_tag"; my $target_att=''; # check if an attribute should be included (in brackets) if( $target_tag=~ /(\w+)\[(\w+)\]/) { $target_tag=$1; $target_att=$2 }; # replace opening tags, existing attributes are untouched if( $target_att) { $_[0]=~ s{<$source_tag\b} {<$target_tag er="$target_att"}gs; } else { $_[0]=~ s{<$source_tag\b} {<$target_tag}gs; } # replace end tags $_[0]=~ s{</$source_tag>} {</$target_tag>}gs; } # this one is a little scarier (especially the last line!) sub wrap($$$) { my $tag= $_[1]; my $expr=$_[2]; $wrapper{$tag}||= make_wrapper( $expr, $tag); &{$wrapper{$tag}}; # Gee, this looks weird! Did I unknowingly use +the # "call this function with the same @_" trick? +I guess so } # this one is the one that does the real work sub make_wrapper($$) { my( $expr, $tag)= @_; my $att= ''; my $subr; # figure out whether an attribute should be included if( $tag=~ /(\w+)\[(\w+)\]/) { $tag=$1; $att=$2 }; # build regexp from the nicer syntax $expr=~ s{(\w+)\b(?![\[\]])}{(<$1.*?</$1>\\\s*)}g; # no att +ribute given $expr=~ s{(\w+)\[(\w+)\]}{(<$1 er=\"$2\".*?</$1>\\\s*)}g; # attrib +ute given $expr=~ s{,\s*} {\\\s*}g; # now build the wrapper subroutine, replacing the expression by th +e tag if( $att) { $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag er=\"$att\">\n".'$1'. +"\n</$tag>}sgo;} "; } else { $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag>\n".'$1'."\n</$tag>}s +go;} "; } return eval "sub { $subr }"; }

Ouf! That's all!

So does anyone see a better way to do this, without writing an XML parser using Parse::RecDescent ?

Replies are listed 'Best First'.
Re: Ugly XML processing looking for a pure XML solution
by merlyn (Sage) on Dec 14, 2000 at 19:46 UTC
    If your input document is indeed XML (and hey, I was gonna do the same trick with MIF, so let me know if this indeed works), then you should be able to construct something with XSLT or XML::XPath to do the trick. Or just construct a DOM object, and hack at it. That'd be better than reconstructing all the argument processing stuff directly in Perl, methinks.

    -- Randal L. Schwartz, Perl hacker

      I agree theoretically, my only problem is that practically I did not find a way to do it.

      The problem is applying a regexp (or regexp-like) expression to a series of tag to wrap them. It looks like I have to face one of 2 nasty problems: either I rewrite the XML parsing process or I rewrite the regexp engine! Not being Ilya the choice was easy ;--)

      I don't think that either XPath or XSLT would let me define a _range_ of elements to be wrapped with the same flexibility the regexp engine gives me. I think XPointer is supposed to include ranges, but I don't know how flexible it will be, and I don't know of any Perl module that supports it.

      The problem seems to be just outside of the specs of most existing tools or standards.

Re: Ugly XML processing looking for a pure XML solution
by eg (Friar) on Dec 14, 2000 at 23:16 UTC

    Assuming that your input XML isn't much more complicated than the above fragment (and even if it is, this can still work), a simple XML::Parser will do the trick.

    #!/usr/bin/perl -w use strict; use XML::Parser; my $parser = new XML::Parser( Style => "::localParser" ); print $parser->parsefile( $ARGV[0] ); ######################################################### ######################################################### parser ######################################################### package localParser; use strict; # # called when the parser starts # sub Init { my $self = shift(); $self->{people} = []; $self->{officers} = []; $self->{text} = ''; } sub Start { my ($self, $element, %attr) = @_; $self->{text} = ''; # clear text to be sure } sub Char { my ($self, $string) = @_; $self->{text} .= $string; # append string to text } sub End { my ($self, $element) = @_; # save string in proper # category. if you have # more complicated data # (i.e. a simple array # won't do), you'll have + # to do something more + # clever if ( $element eq 'Ch_Chair' ) { push( @{$self->{officers}}, $self->{text} ); } elsif ( $element eq 'CommitteeList' ) { push( @{$self->{people}}, $self->{text} ); } $self->{text} = ''; } # # the final output # sub Final { my $self = shift(); my $officers = join("\n\t\t\t", map{ "<person>$_</person>" } @{$self->{officers}}); my $people = join("\n\t\t\t", map{ "<person>$_</person>" } @{$self->{people}}); return <<__HERE__ <doc> <perslist> <officers> $officers </officers> $people </perslist> </doc> __HERE__ } 1;

    So essentially what I'm doing is parsing the data into some intermediate data structure and then outputting that as XML. I'm sure that there are other modules on CPAN that'll help you output valid XML based on some more complicated data structure, rather than this simple collection of two arrays :)

    I'm not sure if this qualifies as a "pure xml solution" though!

      Something like this could certainly works, although it would be more complex as the document is actually more complex and and I have about 30 wrapping rules, so I would not be able to wait for the end of the parsing to output the officers and persons. But see how long your solution is? How much job it is for each rule, and you have to write another piece of code for each different rule, or at least each different type of pattern. And my real transformation table has rules such as:

        stdtitle => 'stddes*, stddesmo?, reaf?, stdcoll?, titlemod?, revision?, title+'

      With a solution like yours I would have to simulate (baddly) the regexp engine, while with the code as it stands I just have to add one line to the %wrap table (and an item in the @wraparray) and... voila!. I get a good chunk of regexps for free

      So your solution qualifies as "pure XML", but fails to be a generic one, while mine is not XML (and thus dangerous), but generic, and I am still searching for my Holy Graal of a generic XML solution (which should have been the title of my first post now that I think about it)...

      Your code uses XML::Parser in a very clean way though, witing your own style and storing parser related data (the text, people and officers fields) with the parser. Neat!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://46623]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-03-29 08:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found