OK, never one too miss an opportunity to contradict myself here
is an XML problem that I could not solve using XML techniques,
and which forced me to fall back on fancy regexp work. Let's see
if anybody can come up with a pure XML solution:
I am converting MIF (FrameMaker Interchange Format) files to XML.
so I first do some deep magic to get the data into some kind of
very flat XML, basically turning all Frame styles into XML tags.
Now I have to add the "superstructure" to the file, all the englobing
elements. To do this I used a mechanism similar to what Frame does
with their own "Conversion Tables": I describe the content of wrapping
tags with a regexp-like syntax. Then off to processing the file...
The initial XML is something like:
<doc><Ch_Chair>Ms Foo</Ch_Chair> <!-- names have been changed to prote
+ct the victims identity -->
<Ch_Chair>Mr Bar</Ch_Chair> <!-- but there are actually _NO_ comm
+ents in the file -->
<CommitteeList>Mss Toto</CommitteeList>
<CommitteeList>Dr Tata</CommitteeList>
<chunk /> <!-- I break the document in chunks so processing is fa
+ster -->
<!-- other chunks -->
<chunk />
</doc>
|
And I want a result like:
<doc><perslist>
<officers>
<person>Ms Foo</person>
<person>Mr Bar</person>
</officers>
<person>Mss Toto</person>
<person>Dr Tata</person>
</perslist>
<!-- stuff -->
</doc>
|
The tables describing the transformation are:
my %change=( 'CommitteeList' => 'person', # CommitteeL
+ist tag should become person
'Ch_Chair' => 'person[officer]'); # Ch_Chair t
+ag should become person too
# but consec
+utive officers should then be
# wrapped in
+ an officer tag
my %wrap= ( officers => 'person[officer]+', # this defin
+es the wrapping
perslist => 'officers?, person+'); # then we wr
+ap the whole list in a perslist tag
my @wrap=( 'officers', 'perslist'); # to process
+ them in the proper order
Seems easy doesn't it? Well here is what I do to process this thing
(I wrote it quite a while ago so the style is not great, please bear with it):
my %wrapper; # stores the regular expression generated from %
+wrap
local $/="<chunk />"; # I have previously inserted those tags so I can
+ split my file in independant chunks
while(my $bit= <$infile>)
{ # the chunk tag is removed here, amongst other irrelevant things
foreach my $tag ( keys %change)
{ change_tag( $bit, $tag, $change{$tag}); } # looks easy isn't i
+t?
foreach my $tag (@wrap)
{ wrap( $bit, $tag, $wrap{$tag}); } # easy too!
# remove the extra er attributes (amongst other things) here
print $bit; # spit it out
}
# this one is easy
# change source_tag into target_tag, adding an er attribute if specifi
+ed
sub change_tag($$$)
{ my $source_tag = $_[1] || warn "no source_tag";
my $target_tag= $_[2] || warn "no target_tag";
my $target_att='';
# check if an attribute should be included (in brackets)
if( $target_tag=~ /(\w+)\[(\w+)\]/)
{ $target_tag=$1; $target_att=$2 };
# replace opening tags, existing attributes are untouched
if( $target_att)
{ $_[0]=~ s{<$source_tag\b} {<$target_tag er="$target_att"}gs; }
else
{ $_[0]=~ s{<$source_tag\b} {<$target_tag}gs; }
# replace end tags
$_[0]=~ s{</$source_tag>} {</$target_tag>}gs;
}
# this one is a little scarier (especially the last line!)
sub wrap($$$)
{ my $tag= $_[1];
my $expr=$_[2];
$wrapper{$tag}||= make_wrapper( $expr, $tag);
&{$wrapper{$tag}}; # Gee, this looks weird! Did I unknowingly use
+the
# "call this function with the same @_" trick?
+I guess so
}
# this one is the one that does the real work
sub make_wrapper($$)
{ my( $expr, $tag)= @_;
my $att= '';
my $subr;
# figure out whether an attribute should be included
if( $tag=~ /(\w+)\[(\w+)\]/)
{ $tag=$1; $att=$2 };
# build regexp from the nicer syntax
$expr=~ s{(\w+)\b(?![\[\]])}{(<$1.*?</$1>\\\s*)}g; # no att
+ribute given
$expr=~ s{(\w+)\[(\w+)\]}{(<$1 er=\"$2\".*?</$1>\\\s*)}g; # attrib
+ute given
$expr=~ s{,\s*} {\\\s*}g;
# now build the wrapper subroutine, replacing the expression by th
+e tag
if( $att)
{ $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag er=\"$att\">\n".'$1'.
+"\n</$tag>}sgo;} "; }
else
{ $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag>\n".'$1'."\n</$tag>}s
+go;} "; }
return eval "sub { $subr }";
}
Ouf! That's all!
So does anyone see a better way to do this, without writing an XML parser
using Parse::RecDescent ?