mirod has asked for the wisdom of the Perl Monks concerning the following question:
OK, never one too miss an opportunity to contradict myself here is an XML problem that I could not solve using XML techniques, and which forced me to fall back on fancy regexp work. Let's see if anybody can come up with a pure XML solution:
I am converting MIF (FrameMaker Interchange Format) files to XML. so I first do some deep magic to get the data into some kind of very flat XML, basically turning all Frame styles into XML tags. Now I have to add the "superstructure" to the file, all the englobing elements. To do this I used a mechanism similar to what Frame does with their own "Conversion Tables": I describe the content of wrapping tags with a regexp-like syntax. Then off to processing the file...
The initial XML is something like:
|
And I want a result like:
|
The tables describing the transformation are:
my %change=( 'CommitteeList' => 'person', # CommitteeL +ist tag should become person 'Ch_Chair' => 'person[officer]'); # Ch_Chair t +ag should become person too # but consec +utive officers should then be # wrapped in + an officer tag my %wrap= ( officers => 'person[officer]+', # this defin +es the wrapping perslist => 'officers?, person+'); # then we wr +ap the whole list in a perslist tag my @wrap=( 'officers', 'perslist'); # to process + them in the proper order
Seems easy doesn't it? Well here is what I do to process this thing (I wrote it quite a while ago so the style is not great, please bear with it):
my %wrapper; # stores the regular expression generated from % +wrap local $/="<chunk />"; # I have previously inserted those tags so I can + split my file in independant chunks while(my $bit= <$infile>) { # the chunk tag is removed here, amongst other irrelevant things foreach my $tag ( keys %change) { change_tag( $bit, $tag, $change{$tag}); } # looks easy isn't i +t? foreach my $tag (@wrap) { wrap( $bit, $tag, $wrap{$tag}); } # easy too! # remove the extra er attributes (amongst other things) here print $bit; # spit it out } # this one is easy # change source_tag into target_tag, adding an er attribute if specifi +ed sub change_tag($$$) { my $source_tag = $_[1] || warn "no source_tag"; my $target_tag= $_[2] || warn "no target_tag"; my $target_att=''; # check if an attribute should be included (in brackets) if( $target_tag=~ /(\w+)\[(\w+)\]/) { $target_tag=$1; $target_att=$2 }; # replace opening tags, existing attributes are untouched if( $target_att) { $_[0]=~ s{<$source_tag\b} {<$target_tag er="$target_att"}gs; } else { $_[0]=~ s{<$source_tag\b} {<$target_tag}gs; } # replace end tags $_[0]=~ s{</$source_tag>} {</$target_tag>}gs; } # this one is a little scarier (especially the last line!) sub wrap($$$) { my $tag= $_[1]; my $expr=$_[2]; $wrapper{$tag}||= make_wrapper( $expr, $tag); &{$wrapper{$tag}}; # Gee, this looks weird! Did I unknowingly use +the # "call this function with the same @_" trick? +I guess so } # this one is the one that does the real work sub make_wrapper($$) { my( $expr, $tag)= @_; my $att= ''; my $subr; # figure out whether an attribute should be included if( $tag=~ /(\w+)\[(\w+)\]/) { $tag=$1; $att=$2 }; # build regexp from the nicer syntax $expr=~ s{(\w+)\b(?![\[\]])}{(<$1.*?</$1>\\\s*)}g; # no att +ribute given $expr=~ s{(\w+)\[(\w+)\]}{(<$1 er=\"$2\".*?</$1>\\\s*)}g; # attrib +ute given $expr=~ s{,\s*} {\\\s*}g; # now build the wrapper subroutine, replacing the expression by th +e tag if( $att) { $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag er=\"$att\">\n".'$1'. +"\n</$tag>}sgo;} "; } else { $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag>\n".'$1'."\n</$tag>}s +go;} "; } return eval "sub { $subr }"; }
Ouf! That's all!
So does anyone see a better way to do this, without writing an XML parser using Parse::RecDescent ?
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Ugly XML processing looking for a pure XML solution
by merlyn (Sage) on Dec 14, 2000 at 19:46 UTC | |
by mirod (Canon) on Dec 14, 2000 at 19:56 UTC | |
Re: Ugly XML processing looking for a pure XML solution
by eg (Friar) on Dec 14, 2000 at 23:16 UTC | |
by mirod (Canon) on Dec 15, 2000 at 13:38 UTC |