Ugly XML processing looking for a pure XML solution

mirod has asked for the wisdom of the Perl Monks concerning the following question:

OK, never one too miss an opportunity to contradict myself here is an XML problem that I could not solve using XML techniques, and which forced me to fall back on fancy regexp work. Let's see if anybody can come up with a pure XML solution:

I am converting MIF (FrameMaker Interchange Format) files to XML. so I first do some deep magic to get the data into some kind of very flat XML, basically turning all Frame styles into XML tags. Now I have to add the "superstructure" to the file, all the englobing elements. To do this I used a mechanism similar to what Frame does with their own "Conversion Tables": I describe the content of wrapping tags with a regexp-like syntax. Then off to processing the file...

The initial XML is something like:

<doc><Ch_Chair>Ms Foo</Ch_Chair> <!-- names have been changed to prote
+ct the victims identity -->
     <Ch_Chair>Mr Bar</Ch_Chair> <!-- but there are actually _NO_ comm
+ents in the file -->
     <CommitteeList>Mss Toto</CommitteeList>
     <CommitteeList>Dr Tata</CommitteeList>
     <chunk /> <!-- I break the document in chunks so processing is fa
+ster -->
     <!-- other chunks -->
     <chunk />
</doc>
[download]

And I want a result like:

<doc><perslist>
       <officers>
         <person>Ms Foo</person>
         <person>Mr Bar</person>
       </officers>
       <person>Mss Toto</person>
       <person>Dr Tata</person>
     </perslist>
     <!-- stuff -->
</doc>
[download]

The tables describing the transformation are:

my %change=( 'CommitteeList'      => 'person',            # CommitteeL
+ist tag should become person
             'Ch_Chair'           => 'person[officer]');  # Ch_Chair t
+ag should become person too 
                                                          # but consec
+utive officers should then be 
                                                          # wrapped in
+ an officer tag

my %wrap= ( officers         => 'person[officer]+',       # this defin
+es the wrapping
            perslist         => 'officers?, person+');    # then we wr
+ap the whole list in a perslist tag

my @wrap=( 'officers', 'perslist');                       # to process
+ them in the proper order
[download]

Seems easy doesn't it? Well here is what I do to process this thing (I wrote it quite a while ago so the style is not great, please bear with it):

my %wrapper;          # stores the regular expression generated from %
+wrap 

local $/="<chunk />"; # I have previously inserted those tags so I can
+ split my file in independant chunks

while(my $bit= <$infile>)
  { # the chunk tag is removed here, amongst other irrelevant things

    foreach my $tag ( keys %change)
      { change_tag( $bit, $tag, $change{$tag}); } # looks easy isn't i
+t?

    foreach my $tag (@wrap)
      { wrap( $bit, $tag, $wrap{$tag}); }         # easy too!

    # remove the extra er attributes (amongst other things) here

    print $bit;                                   # spit it out
  }
 

# this one is easy

# change source_tag into target_tag, adding an er attribute if specifi
+ed
sub change_tag($$$)
  { my $source_tag = $_[1] || warn "no source_tag";
    my $target_tag= $_[2] || warn "no target_tag";
    my $target_att='';

    # check if an attribute should be included (in brackets)
    if( $target_tag=~ /(\w+)\[(\w+)\]/)
      { $target_tag=$1; $target_att=$2 };

    # replace opening tags, existing attributes are untouched
    if( $target_att)
      { $_[0]=~ s{<$source_tag\b} {<$target_tag er="$target_att"}gs; }
    else
      { $_[0]=~ s{<$source_tag\b} {<$target_tag}gs;  }
    
    # replace end tags
    $_[0]=~ s{</$source_tag>} {</$target_tag>}gs; 

  }


# this one is a little scarier (especially the last line!)
sub wrap($$$)
  { my $tag= $_[1];
    my $expr=$_[2];
    $wrapper{$tag}||= make_wrapper( $expr, $tag);
    &{$wrapper{$tag}}; # Gee, this looks weird! Did I unknowingly use 
+the
                       # "call this function with the same @_" trick? 
+I guess so
  }

# this one is the one that does the real work
sub make_wrapper($$)
  { my( $expr, $tag)= @_;
    my $att= '';
    my $subr;

    # figure out whether an attribute should be included
    if( $tag=~ /(\w+)\[(\w+)\]/)
      { $tag=$1; $att=$2 };

    # build regexp from the nicer syntax
    $expr=~ s{(\w+)\b(?![\[\]])}{(<$1.*?</$1>\\\s*)}g;        # no att
+ribute given
    $expr=~ s{(\w+)\[(\w+)\]}{(<$1 er=\"$2\".*?</$1>\\\s*)}g; # attrib
+ute given
    $expr=~ s{,\s*} {\\\s*}g;

    # now build the wrapper subroutine, replacing the expression by th
+e tag
    if( $att)
      { $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag er=\"$att\">\n".'$1'.
+"\n</$tag>}sgo;} "; }
    else
      { $subr= "{ ".'$_[0]'."=~ s{($expr)}{<$tag>\n".'$1'."\n</$tag>}s
+go;} "; }

    return eval "sub { $subr }";
  }
[download]

Ouf! That's all!

So does anyone see a better way to do this, without writing an XML parser using Parse::RecDescent ?

Comment on Ugly XML processing looking for a pure XML solution Select or Download Code

Replies are listed 'Best First'.
Re: Ugly XML processing looking for a pure XML solution by merlyn (Sage) on Dec 14, 2000 at 19:46 UTC
If your input document is indeed XML (and hey, I was gonna do the same trick with MIF, so let me know if this indeed works), then you should be able to construct something with XSLT or XML::XPath to do the trick. Or just construct a DOM object, and hack at it. That'd be better than reconstructing all the argument processing stuff directly in Perl, methinks. -- Randal L. Schwartz, Perl hacker	[reply]
Re: Re: Ugly XML processing looking for a pure XML solution by mirod (Canon) on Dec 14, 2000 at 19:56 UTC
I agree theoretically, my only problem is that practically I did not find a way to do it. The problem is applying a regexp (or regexp-like) expression to a series of tag to wrap them. It looks like I have to face one of 2 nasty problems: either I rewrite the XML parsing process or I rewrite the regexp engine! Not being Ilya the choice was easy ;--) I don't think that either XPath or XSLT would let me define a _range_ of elements to be wrapped with the same flexibility the regexp engine gives me. I think XPointer is supposed to include ranges, but I don't know how flexible it will be, and I don't know of any Perl module that supports it. The problem seems to be just outside of the specs of most existing tools or standards.	[reply]
Re: Ugly XML processing looking for a pure XML solution by eg (Friar) on Dec 14, 2000 at 23:16 UTC
Assuming that your input XML isn't much more complicated than the above fragment (and even if it is, this can still work), a simple XML::Parser will do the trick. #!/usr/bin/perl -w use strict; use XML::Parser; my $parser = new XML::Parser( Style => "::localParser" ); print $parser->parsefile( $ARGV[0] ); ######################################################### ######################################################### parser ######################################################### package localParser; use strict; # # called when the parser starts # sub Init { my $self = shift(); $self->{people} = []; $self->{officers} = []; $self->{text} = ''; } sub Start { my ($self, $element, %attr) = @_; $self->{text} = ''; # clear text to be sure } sub Char { my ($self, $string) = @_; $self->{text} .= $string; # append string to text } sub End { my ($self, $element) = @_; # save string in proper # category. if you have # more complicated data # (i.e. a simple array # won't do), you'll have + # to do something more + # clever if ( $element eq 'Ch_Chair' ) { push( @{$self->{officers}}, $self->{text} ); } elsif ( $element eq 'CommitteeList' ) { push( @{$self->{people}}, $self->{text} ); } $self->{text} = ''; } # # the final output # sub Final { my $self = shift(); my $officers = join("\n\t\t\t", map{ "<person>$_</person>" } @{$self->{officers}}); my $people = join("\n\t\t\t", map{ "<person>$_</person>" } @{$self->{people}}); return <<__HERE__ <doc> <perslist> <officers> $officers </officers> $people </perslist> </doc> __HERE__ } 1; [download] So essentially what I'm doing is parsing the data into some intermediate data structure and then outputting that as XML. I'm sure that there are other modules on CPAN that'll help you output valid XML based on some more complicated data structure, rather than this simple collection of two arrays :) I'm not sure if this qualifies as a "pure xml solution" though!	[reply] [d/l]
Re: Re: Ugly XML processing looking for a pure XML solution by mirod (Canon) on Dec 15, 2000 at 13:38 UTC
Something like this could certainly works, although it would be more complex as the document is actually more complex and and I have about 30 wrapping rules, so I would not be able to wait for the end of the parsing to output the officers and persons. But see how long your solution is? How much job it is for each rule, and you have to write another piece of code for each different rule, or at least each different type of pattern. And my real transformation table has rules such as: `stdtitle => 'stddes*, stddesmo?, reaf?, stdcoll?, titlemod?, revision?, title+'` With a solution like yours I would have to simulate (baddly) the regexp engine, while with the code as it stands I just have to add one line to the `%wrap` table (and an item in the `@wrap`array) and... voila!. I get a good chunk of regexps for free So your solution qualifies as "pure XML", but fails to be a generic one, while mine is not XML (and thus dangerous), but generic, and I am still searching for my Holy Graal of a generic XML solution (which should have been the title of my first post now that I think about it)... Your code uses XML::Parser in a very clean way though, witing your own style and storing parser related data (the `text, people and officers` fields) with the parser. Neat!	[reply] [d/l]