Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

XML gurus unite!!

by jmmistrot (Sexton)
on Mar 05, 2007 at 08:49 UTC ( [id://603174]=perlquestion: print w/replies, xml ) Need Help??

jmmistrot has asked for the wisdom of the Perl Monks concerning the following question:

I am new to XML parsing writing so be gentle... :) I have the following bit of XML I want to read in and and spit back out in exactly the same form but with different values:
<?xml version="1.0" encoding="utf-8"?> <AttributeDatabase> <attribclass toolonly="false" hidden="false" dbvault="true"> <editlist name="default"> <entry>*</entry> </editlist> <preferrededitlist>"default"</preferrededitlist> <attribdef name="mtl_param" synthetic="true" array="true"> <desc>IsSynthetic</desc> <type>Attrib.Types.Vector4</type> <bound>True</bound> <properties platform="common"> <mincount>2</mincount> <maxcount>2</maxcount> </properties> </attribdef> <attribdef name="mtl_Kdiff" toolonly="true"> <memberkeyvalue path="" key="push(bits)-&gt;::mtl_param:common +:0:w" /> <desc>controls --&gt; mtl_param[0].w </desc> <type>EA.Reflection.Float</type> <bound>False</bound> <increment>0.001</increment> <properties platform="common"> <max>2.0</max> <min>0.0</min> <defaultreflectedobject>1.0</defaultreflectedobject> </properties> </attribdef> </attribclass> </AttributeDatabase>
I have no problem reading things in with XML-Simple
my $fh; open( $fh , "<", $arg{file}) or die; my $xml = XML::Simple->new(KeepRoot=>1) ; my $xml_data = $xml->XMLin($fh);
I do stuff to it... and then go to write it out thus:
my $xml = XML::Simple->new(XMLDecl=>1,KeepRoot=>1) ; my $xml_data = $xml->XMLout($arg{data});
but no matter which options I try with XML-Simple I can't reconstruct the same XML output as input. The closest I have come is:
<?xml version='1.0' standalone='yes'?> <AttributeDatabase> <attribclass> <attribdef> <name>mtl_Kdiff</name> <bound>False</bound> <desc>controls --&gt; mtl_param[0].w </desc> <increment>0.001</increment> <memberkeyvalue> <key>push(bits)-&gt;::mtl_param:common:0:w</key> <path></path> </memberkeyvalue> <properties> <defaultreflectedobject>1.0</defaultreflectedobject> <max>2.0</max> <min>0.0</min> <platform>common</platform> </properties> <toolonly>true</toolonly> <type>EA.Reflection.Float</type> </attribdef> <attribdef> <name>mtl_param</name> <array>true</array> <bound>True</bound> <desc>IsSynthetic</desc> <properties> <maxcount>2</maxcount> <mincount>2</mincount> <platform>common</platform> </properties> <synthetic>true</synthetic> <type>Attrib.Types.Vector4</type> </attribdef> <dbvault>true</dbvault> <editlist> <name>default</name> <entry>*</entry> </editlist> <hidden>false</hidden> <preferrededitlist>&quot;default&quot;</preferrededitlist> <toolonly>false</toolonly> </attribclass> </AttributeDatabase>
Notice how the embedded element attributes in the attribclass and attribdef tags drop out. and my quotes(") become &quot; or drop out altogether... Does anyone know how to keep some attributes from dropping out of element tags and others let go? I am afraid I am gonna have to write out each element as I traverse the data structure so that can control the output of XMLout(). Anyone got any advice? Your humble DTD Destroyer, jmm

Replies are listed 'Best First'.
Re: XML gurus unite!!
by Corion (Patriarch) on Mar 05, 2007 at 08:57 UTC

    I don't know XML::Simple much, but the conversion of

    <preferrededitlist>"default"</preferrededitlist>
    to
    <preferrededitlist>&quot;default&quot;</preferrededitlist>

    looks perfectly valid to me, except that your consumer of the output likely doesn't know what to do with that. Maybe you should use a templating module like Template::Toolkit instead. Another idea could be to try other XML modules, maybe the less compliant XML::Tiny or some other tagsoup parser.

Re: XML gurus unite!!
by varian (Chaplain) on Mar 05, 2007 at 09:13 UTC
    If you use a combination of ForceArray and an empty KeyAttr then the embedded attributes will be captured nicely:
    # translate xml text to hash ref $xmlparms = eval { XML::Simple::XMLin($xmltext, nsexpand => 1, ForceArray => 1, KeepRoot => 1, KeyAttr =>[], )};
    I do not copy your quotes problem, XML::Simple renders the quotes as is into the hash. Probably this is converted during instream before XML::Simple gets involved?

    Update:
    Actually the quotes are generated per default by XML::Simple on output. You can turn off this behavior by specifying the option:

    NoEscape=>1
    Doing so you run the risk of introducing fake tags into your xml stream, e.g. if your text includes a '<' character.
Re: XML gurus unite!!
by merlyn (Sage) on Mar 05, 2007 at 14:59 UTC
    Get a better (faster) parser, such as XML::LibXML. Using XML::XSH2, I loaded and saved that file, and the result was:
    <?xml version="1.0" encoding="utf-8"?> <AttributeDatabase> <attribclass toolonly="false" hidden="false" dbvault="true"> <editlist name="default"> <entry>*</entry> </editlist> <preferrededitlist>"default"</preferrededitlist> <attribdef name="mtl_param" synthetic="true" array="true"> <desc>IsSynthetic</desc> <type>Attrib.Types.Vector4</type> <bound>True</bound> <properties platform="common"> <mincount>2</mincount> <maxcount>2</maxcount> </properties> </attribdef> <attribdef name="mtl_Kdiff" toolonly="true"> <memberkeyvalue path="" key="push(bits)-&gt;::mtl_param:common +:0:w"/> <desc>controls --&gt; mtl_param[0].w </desc> <type>EA.Reflection.Float</type> <bound>False</bound> <increment>0.001</increment> <properties platform="common"> <max>2.0</max> <min>0.0</min> <defaultreflectedobject>1.0</defaultreflectedobject> </properties> </attribdef> </attribclass> </AttributeDatabase>
    And using the XSH2 language, I could have modified the tree with a mixture of XPath loops and Perl expressions. It's quite elegant, and even compiles down to Pure Perl code.

    I have an article on using xsh2 to scrape HTML as well (embargoed at the moment, should go live in a few weeks).

Re: XML gurus unite!!
by Herkum (Parson) on Mar 05, 2007 at 11:48 UTC

    Don't use XML::Simple, it is one of the most complicated 'Simple' modules that I have ever come across.

    XML::Twig is better for XML documents that have any sort of complexity.

      I agree, XML::Twig is great, although its API is humongous. It takes a while to find out what the correct name of the method is that you need, and they are not even sorted in any sensible way in the documentation.

      XML::Twig also adheres to TIMTOWTDI. I can see two immediate ways to achieve your goal:

      1. Read in the whole XML file with

        XML::Twig->new()->parsefile('my xml file');

        then access individual nodes and change their values. (For example,

        my $twig = XML::Twig->new()->parsefile('my xml file'); # Use XPath-like expressions to find the nodes you want. my @nodes = $twig->get_xpath('attribclass/attribdef[@name="mtl_Kdiff') +; for my $node (@nodes) { # Process... # Example: $node->set_att('synthetic', 'false'); } # Or navigate through references. my @attribdefs = $twig->root->first_child('attribclass')->children('at +tribdef'); for my $node (@attribdefs) { # Process... }
      2. Create an XML filter:

        my $twig = XML::Twig->new( twig_handlers => { 'attribdef' => sub { my ($twig, $elt) = @_; if ($elt->att('name') eq 'mtl_param') { # Do something. } elsif (...) { } # etc. $twig->flush; } } )->parsefile('my xml file'); $twig->flush;

        This will read in the file, parse it, and while parsing, call the twig handlers defined above. The handlers can do their stuff (change element names, change values, cut and paste subtrees, and all other cool things), and the final XML text will be output to STDOUT ($twig->flush;).

        However, this approach is not very practical if you first need to read in some data from the file, do some processing with them, and only later go and update the file. You could first read in the file in as a data structure (almost?) identical to what XML::Simple produces:

        my $hash = XML::Twig->new()->parsefile('my xml file')->simplify(); use Data::Dumper; print Dumper($hash);

        You can then do what you like with the read values, and later use that information to construct an XML filter that will produce the final file.

      This sounds more complicated than it really is. On the other hand, XML is often too complicated for its own good.

      Warning: code examples not tested.

      --
      print "Just Another Perl Adept\n";

        I've been down this road before, and depending on what you're trying to do another answer is to use XSLT. Now if you're not familiar with it, it certainly takes getting used to. Even when I knew it I would write a program using XML::Twig. In hindsight I've found using a simple XSLT copy template, and then adding in the templates for the cases you need is best. Additionally, the nice thing about XSLT is that you can read ahead/behind of your current node in order to pull different values together. The next level is to realize you can plug in Java functions into the namespace and get any kind of programming done. Of course none of this takes into account your memory or processing speed requirements.
Re: XML gurus unite!!
by Moron (Curate) on Mar 05, 2007 at 12:29 UTC
    I'd agree that XML::Twig is purpose-built for reading in "twigging the data" and writing back out again.

    There is a monk here somewhere whose signature is something like, "Don't write your own XML parser." But of all the scores of languages I have ever seen, I have found XML to be absolutely the easiest to write a parser for. As far as your needs are concerned, the Backus Naur Form (BNF) is the shortest I can imagine for a language. Something like this would about cover it:

    Document :== Heading [Tag ... ] Heading :== "<" [!">" ...] ">" Tag :== "<" { TagName [Assignment ...]} ">" {Value|Substructure} "</" +Tagname ">" Value :== [!"</" ...] Substructure :== [Tag...] Assignment :== name "=" QuotedString
    guaranteeing that the parser (which does nothing other than express the BNF in code form) should be as trivial as it gets.

    Of course, you also need a lexical analyser - about half a page in Perl and a thrower to walk past whitespace and carriiage returns, which can also poll the lexer rather than be written from scratch. You would also need to choose a structure that differentiates between simple value tags and tags that contain a substructure (e.g. tagname => { VALUE => scalar } versus tagname => { SUBTAGS => arrayReference }).

    The code generation is a mirror of the parser, reading through your datastructure and generating the appropriate XML - thus equally trivial. You need to track the recursion depth of the puttag routine and just multiply $tabsize*($depth - 1) X " " to indent, putting each tag on its own line.

    But the case for writing your own parser actually depends on whether or not you have a continuing need to meet new requirements that you cannot predict in advance (in my case there are multiple streams of XML to and from different organisations that have to be addressed for a single system) and cannot therefore nail your colours to any particular module that might already be available.

    -M

    Free your mind

      I don't agree on writing a custom XML parser, but if that's your poison, I recommend Parse::RecDescent. It's simply superb.

      --
      print "Just Another Perl Adept\n";

        Writing parsers from scratch is like cleaning the bathroom. It's a dirty job, but someone has to do it and once you do it once, it becomes part of your daily routine.

        -M

        Free your mind

Re: XML gurus unite!!
by dmitri (Priest) on Mar 05, 2007 at 20:55 UTC
    I had a similar problem, mucked with XML::Simple, XML::Writer and gave up... Here's what I did: use HTML::Template to create the XML and then use XML::Tidy to pretty-print it. Works like a charm, and you are in full control of attributes, escaping, and everything else.
Re: XML gurus unite!!
by gasho (Beadle) on Mar 06, 2007 at 18:01 UTC
    Try to use XML::Twig; http://www.xmltwig.com/xmltwig/tutorial/index.html
    (: Life is short enjoy it :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://603174]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-03-28 21:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found