peace has asked for the wisdom of the Perl Monks concerning the following question:

Most Esteemed and Holiest Monks,

I supplicate you for perls of wisdom...

I've got a large XML document with 270 <graphic> elements in it, with <title> elements that precede the
<graphic>elements, as in the following snippet from the file:


<figure airframe="PLATFORM-1" id="aircraftupdimensions" span="1" tocentry="1">
<title>Aircraft Up Dimensions</title>
<graphic fileref="MH60S_Figures\Chapter_1\aircraftupdimensions.jpg" />
</figure>
<figure airframe="PLATFORM-1" id="aircraftupclearanceandturningradius" span="1">
<title>Aircraft Up Clearance and Turning Radius</title>
<graphic fileref="MH60S_Figures\Chapter_1\aircraftupclearanceandturningradius.jpg" />
</figure>
<para airframe="PLATFORM-1"></para>
<figure airframe="PLATFORM-1" id="aircraftdownclearanceandturningradius" span="1">
<title>Aircraft Down Clearance and Turning Radius</title>
<graphic fileref="MH60S_Figures\Chapter_1\aircraftdownclearanceandturningradius.jpg" />
</figure>
<figure airframe="PLATFORM-1" id="aircraftdowndimensions" span="1" tocentry="1">
<title>Aircraft Down Dimensions</title>
<graphic fileref="MH60S_Figures\Chapter_1\aircraftdowndimensions.jpg" />
</figure>


I was advised that it would be pretty easy to use perl to transform my document so that the
<title>...</title> text would follow the <graphic> element in all 270 <figure>...</figure> contexts.

In fact my advisor gave me the following command line statement to do the transform. But it doesn't
do the job.

perl -p -i.bak -e "s/(<figure.*>)(<title.*>.*<\/title>)(<graphic.*\/>)(<\/figure>)/$1$3$2$4/;" MyDoc.xml

Any ideas what might be going wrong? And what would work?


Peace,

--Jack

Here's what the transformed snippet should look like---with or without the line breaks;
it could be just one long line of text with no newlines (\n):

<figure airframe="PLATFORM-1" id="aircraftupdimensions" span="1" tocentry="1">
<graphic fileref="MH60S_Figures\Chapter_1\aircraftupdimensions.jpg" />
<title>Aircraft Up Dimensions</title>
</figure>
<figure airframe="PLATFORM-1" id="aircraftupclearanceandturningradius" span="1" >
<graphic fileref="MH60S_Figures\Chapter_1\aircraftupclearanceandturningradius.jpg" />
<title>Aircraft Up Clearance and Turning Radius</title>
</figure>
<para airframe="PLATFORM-1"></para>
<figure airframe="PLATFORM-1" id="aircraftdownclearanceandturningradius" span="1" >
<graphic fileref="MH60S_Figures\Chapter_1\aircraftdownclearanceandturningradius.jpg"/>
<title>Aircraft Down Clearance and Turning Radius</title>
</figure>
<figure airframe="PLATFORM-1" id="aircraftdowndimensions" span="1" tocentry="1">
<graphic fileref="MH60S_Figures\Chapter_1\aircraftdowndimensions.jpg"/>
<title>Aircraft Down Dimensions</title>
</figure>

Replies are listed 'Best First'.
Re: Swapping XML Elements
by GrandFather (Saint) on Nov 29, 2005 at 21:27 UTC

    To emphasise and make a little clearer what previous replies have said: It is almost never appropriate to use a regex to manipulate XML. The two biggest problems are whitespace and nested elements. It is very hard, and in many cases not possible in any sensible way, to write regexen to perform edits on XML.

    As mentioned by marto, XML::Twig is the way to do this sort of thing:

    use strict; use warnings; use XML::Twig; my $t= XML::Twig->new ( twig_roots => {'figure' => \&doSwap}, twig_print_outside_roots => 1, ); my $source = do {local $/ = ''; <DATA>}; $t->set_pretty_print ('record'); $t->parse ($source); sub doSwap { my ($t, $figure)= @_; my @title = $figure->cut_children ('title'); $title[0]->paste ('last_child', $figure); $figure->print; }

    DWIM is Perl's answer to Gödel
Re: Swapping XML Elements
by marto (Cardinal) on Nov 29, 2005 at 19:38 UTC
    peace,

    If you are able to use a module to achieve this I would suggest XML::Twig.
    The examples on www.xmltwig.com, along with the documentation are helpful.
    In fact in the Tutorials section the first example is titled "Reordering an XML file".

    Hope this helps.

    Martin
Re: Swapping XML Elements
by peace (Novice) on Nov 30, 2005 at 22:05 UTC
    Thanks. You guys are terrific! I wasn't keen on using regular expressions,
    so the Twigs module was right on target.

    Here's the code I ended up using to fix up my 3MB XML document.

    use strict; use warnings; use XML::Twig; my $infile = "in-test.xml"; my $outfile = "out-test.xml"; open(OUT, ">$outfile") or die "can\'t open output file $outfile:$!"; open(IN, "$infile") or die "can\'t open input file $infile:$!"; my $t = XML::Twig->new ( twig_roots => {'figure' => \&doSwap}, twig_print_outside_roots => \*OUT, ); $t->parse (\*IN); close OUT or die "Could not close $outfile"; close IN or die "Could not close $infile"; sub doSwap { my ($t, $figure) = @_; # required argumen +ts for twig handlers my @title = $figure->cut_children ('title'); # extract (the onl +y) <title> node from the <figure> node $title[0]->paste (last_child => $figure); # make it the last + child of that <figure> node $figure->print(\*OUT); # print the rearra +nged <figure> node }

    Peace,

    --Jack
      Below is an XSLT solution to my swapping problem. It was quite interesting that it took the XSLT engine I was using (MSXML4.0) over 21 seconds to generate an output file on my 3MB input (45K+ XML nodes). The Perl program I posted last week generated an output file in a couple of seconds.
      <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/T +ransform"> <!-- 12/01/2005 jkulas@lsijax.com --> <!-- This transforms figure by moving title to last_child position + --> <!-- Functionally similar to twig_roots --> <xsl:template match ="figure"> <xsl:element name="figure"> <xsl:copy-of select="@*"/> <xsl:apply-templates/> <!-- Add the title node as last child--> <xsl:if test="title"> <xsl:element name ="title"> <xsl:copy-of select="title/@*"/> <xsl:copy-of select="title/text()"/> </xsl:element> </xsl:if> </xsl:element> </xsl:template> <!-- Ignore any figure/title's since they are handled in the figur +e template--> <xsl:template match ="figure/title"> </xsl:template> <!-- Copy all nodes, except figure nodes (of course)--> <!-- Adapted from Michael Kay, _XSLT 2.0_, 3/e, p. 243--> <!-- Functionally similar to twig_print_outside_roots --> <xsl:template match ="@* | node()"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> </xsl:stylesheet>
      Peace,
      --Jack
Re: Swapping XML Elements
by mrborisguy (Hermit) on Nov 29, 2005 at 19:31 UTC

    Here's what I observe: There may or may not be a line break between the elements. However, your regex assumes that there is not. Somehow, you need to get the entire document into one variable, and run the substitutions on that variable, making sure you account for possible newlines in the variable.

        -Bryan