Sporti69 has asked for the wisdom of the Perl Monks concerning the following question:

Hi yall, Being a newbie, I get lost between the loads of XML modules. Here is my question/problem: I need an easy way to read an XML file, no matter the structure, the module/code/script should read the xlm to the memory in a nice nested variable (eg array of hashes or something else depending on the structure of the xml) So a universal perl interpreter giving me a var that has everything stocked in there, nice and structured... Help is needed :) Kindest Regards Tom and his first post on perlmonks ^^

Replies are listed 'Best First'.
Re: XML Module
by psini (Deacon) on May 31, 2008 at 14:07 UTC
Re: XML Module
by Your Mother (Archbishop) on May 31, 2008 at 17:21 UTC

    I second what psini says. XML::Simple considered harmful. This is the normal lifecycle with it-

    1. You need an XML module... Let's check CPAN.
    2. Wow, XML::Parser, XML::LibXML, XML::Compile, XML::Twig... Yikes. This could take hours to learn one of these. Oh, hey! XML::Simple.
    3. That worked great!
    4. Oh, but you need it a little different. So just read the XML::Simple docs. It should be simple to change.
    5. ...Hours go by...
    6. ...Things are thrown...
    7. ...Questions are asked at SOPW...
    8. Someone mentions several older threads and trying a search first next time.
    9. You end up getting one of the fully featured modules.
    10. You spend about as much time learning it as you already flushed fighting XML::Simple over whether your data are elements or attributes.
    11. Now you have a real skill with a powerful tool and if you picked something with a DOM standards interface like LibXML, you also just picked up a bunch of transferable skills like hacking JS.

    For my own part, I find XML::LibXML, XML::Compile, XML::Twig to be the sweet spots.

Re: XML Module
by ides (Deacon) on May 31, 2008 at 15:04 UTC

    The easiest way to do that is to use XML::Simple. However, it isn't the fastest module if your XML documents are really large and you only need to work with certain pieces of it. For cases like that I'd suggest using XML::Twig or the previously mentioned XML::Parser.

    Hope that helps!

    Frank Wiles <frank@revsys.com>
    www.revsys.com

      I myself dislike XML::Simple because not only it's not very fast but it has some questionable default behaviours, like the infamous KeyAttr options (in last week only, no less than 3 questions here in SoPW were related to this "feature").

      I believe that simple should mean "simple", not "simple if you want to do things my way only".

      Careful with that hash Eugene.

Re: XML Module
by Jenda (Abbot) on Jun 01, 2008 at 13:39 UTC

    Looks like XML::Simple gets a lot of bad press here. I guess it stems from the fact that most people fail to read the docs even if short. Let's see, what are the problems with XML::Simple.

    First, the data structure it produces is not always consistent. Eg. for XML like this:

    <root> <tag> <sub>foo</sub> </tag> <tag> <sub>bar</sub> <sub>baz</sub> </tag> </root>
    the <sub> is once converted to a scalar and second time to an array of scalars. Big deal! Here comes ForceArray=>[qw(list of tags that may be repeated)].

    Next problem is that it's a bit too aggressice in trying to help you with transforming

    <root> <tag> <name>foo</name> <value>475</value> </tag> <tag> <name>bar</name> <value>147</value> </tag> </root>
    to
    { 'tag' => { 'bar' => { 'value' => '147' }, 'foo' => { 'value' => '475' } } }
    Again, huge deal, READ THE DOCS and set KeyAttr => [] or to whatever list of tags/attributes you do want to fold on.

    There is a problem though that has not been adequately handled in XML::Simple yet though. The inconsistency of

    <root> <tag>content only</tag> <tag attr="1">and content</tag> </root>
    If you have a tag that has only optional attributes and it sometimes has and somethimes doesn't have the attributes it's harder than necessary to find out the content. You have to use ref() to see whether the <tag> produced a scalar or a hashref. There is an option that can force XML::Simple to always produce the hashref, but it applies to all tags, not just those few that it makes sense for. It's not actually that hard to implement so that it supports the same kind of settings as ForceArray. I just did that and will send a patch to the module maintaner shortly.

    So all you have to do to get a nice, clean, consistent minimal datastructure out of the XML is to set ForceArray, KeyAttr and ForceContent accordingly. Big deal.

    Besides you can infer the tags that need the ForceArray and ForceContent from the example XMLs, the DTD or the Schema. I actually already have the inferring from example XMLs for my XML::Rules done and it's trivila to change it to produce the options in the XML::Simple format. The upcomming version of XML::Rules will contain functions that'll for inferring these options from examples and DTDs for both.

    P.S.: Sporti69, you may of course consider using my XML::Rules instead, with a little more work it can give you a more streamlined or even filtered and tweaked datastructure and would allow you to process the XML in chunks instead of loading everything into memory first and only then giving you a chance to process anything.

    P.P.S.: I did not discuss one "problem" of XML::Simple, it doesn't preserve the order of child tags. When was the last time you needed that when extracting data from a data-oriented XML? That information would just waste memory and possibly complicate the access in such applications. Of course it means that XML::Simple is not suited for document-oriented XML and for modifying XML that's supposed to be used by a more strict application. If you need that, use a different module.