Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Accessing mixed content in XML

by anthski (Scribe)
on Aug 09, 2005 at 22:42 UTC ( #482456=perlquestion: print w/replies, xml ) Need Help??

anthski has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I'm trying to use an XML based configuration file for a script and have hit a (common, I assume) problem where an element contains mixed content.

For example, a relevant snippet is:

<function name="showImage"> <data>A random picture</data> <argument><img src="test.jpg"></argument> <argument>0</argument> </function>

The <img src="..."> tag here is a simplified example of an argument which may have more than one HTML style tag included as an attribute for the argument element.

The point is that I want to be able to tell my XML parser that anything contained within the <argument></argument> element should /always/ be treated as a single attribute, because sometimes it may contain HTML tags, and sometimes it may not.

I initially tried using XML::Simple to slurp the config file in as a hash but it doesn't support mixed content, so I've moved onto XML::DOM which boasts support for this, but for which I find the documentation somewhat confusing/unclear.

If I throw the following snippet of code at the aforementioned xml file

#!/usr/bin/perl -w use strict; use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("config.xml"); my $config = $doc->getDocumentElement; my @functions = $config->getElementsByTagName('function'); foreach my $function (@functions) { my @arguments = $function->getElementsByTagName('argument'); foreach my $argument (@arguments) { my @argumentValue = $argument->getFirstChild->getData; print "argument: @argumentValue\n"; } }

Then I end up with the error:

mismatched tag at line 3, column 36, byte 98 at /usr/lib/perl5/vendor_ +perl/5.8.0/i386-linux-thread-multi/XML/ line 185

I assume that lots of people have at some stage wanted to include html tags inside an xml file, and not wanted their parser to try to offer it as a separate element with attributes. I might be wrong!

Any advice on what I can do would be appreciated.

Replies are listed 'Best First'.
Re: Accessing mixed content in XML
by samtregar (Abbot) on Aug 09, 2005 at 22:52 UTC
    I gather this isn't what you want to hear, but you'll have to escape that HTML to get it into XML. For example:

    <argument>&lt;img src="test.jpg"&gt;</argument>

    If you don't then you don't have valid XML and no XML parser I've ever seen is going to parse the file successfully.



      I'm quite happy to hear that suggestion as it's easily implemented and works perfectly for what I want to do.

      Thanks for your help. Very much appreciated.

        Sure, that works fine for the tiny example shown, but it won't work in the arbitrary case. HTML cannot be trivially transformed into XML in many cases.


Re: Accessing mixed content in XML
by izut (Chaplain) on Aug 10, 2005 at 02:34 UTC
    Your XML should be like this:
    <function name="showImage"> <data>A random picture</data> <argument><![CDATA[<img src="test.jpg">]]></argument> <argument>0</argument> </function>
    Now the perl code:
    #!/usr/bin/env perl use strict; use warnings; use XML::Simple; use Data::Dumper; my $ref = XMLin(\*DATA); print Dumper $ref; __DATA__ <function name="showImage"> <data>A random picture</data> <argument><![CDATA[<img src="test.jpg">]]></argument> <argument>0</argument> </function>
    The results:
    $VAR1 = { 'argument' => [ '<img src="test.jpg">', '0' ], 'name' => 'showImage', 'data' => 'A random picture' };

    You can read more about XML specs here or searching in Google.

    Update:If you use XML::Simple to create the XML file, it will automagically convert ">" or "<" to &gt; and &lt;.

    Igor S. Lopes - izut
    surrender to perl. your code, your rules.

      Thanks very much for your well explained solution, example code and the links. I'll do some reading about XML.

      If I'd known about CDATA and XML::Simple supporting it, I'd have probably stuck with XML::Simple. For now, using > and < is working nicely with XML::DOM.

      Thanks again.

        Instead of XML::DOM you might want to have a look at XML::LibXML, which gives you a lot more than XML::DOM: XPath (very, VERY useful), RelaxNG, Xinclude, HTML parser, better performances... As it implements the DOM porting code from XML::DOM to XML::LibXML is also very easy (usually all you have to do is changing the names of a few constants).

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://482456]
Approved by samtregar
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2022-11-29 01:02 GMT
Find Nodes?
    Voting Booth?