ninja-joe has asked for the wisdom of the Perl Monks concerning the following question:

Hiya all! Been a while since I've posted. I'm heading to school in a couple of weeks and I was considering writing a note taking program of sorts.

The idea was to be able to write an XML parser that would allow me to nest sections in the notes and do term definitions (among other things superfluous to this post) and it would all be eventually parsed into an eye friendly HTML file. XML sample:
<section id="chap1"> <section id="part1">infoinfostuffhere <def word="someword">some definition of the word</def +> more info and notes </section> <section id="part2"> etc and so on, possibly nesting section tabs </section> </section>
All is good and well until it comes time to pick the module to work with. I automatically wanted to jump for XML::Parser but I've heard there are limitations... Are these built in? I checked the CPAN documentation and I couldn't find any mention/numbers about it. Are there formal limitations of XML::Parser or simply ones that come out of poor performance with large files?

I was considering XML::Twig or XML::DOM in no particular order for alternates.

Any suggestions?
Thanks

Replies are listed 'Best First'.
(jeffa) Re: Picking an XML Module
by jeffa (Bishop) on Aug 03, 2003 at 15:49 UTC
    This sounds like a good candidate for XML::LibXML and XML::LibXSLT.

    UPDATE:
    First off, i really think you should change the structure of your XML to something like: def.xml

    <book> <chapter id="1"> <part id="1"> <info>infoinfostuffhere</info> <def word="someword">some definition of the word</def> <extra>more info and notes</extra> </part> <part id="2"> <info>infoinfostuffhere</info> <def word="someword">some definition of the word</def> <extra>more info and notes</extra> </part> </chapter> </book>
    This will make parsing much easier ... having a bunch of <section id="foo"> tags is just too general. Also, be sure and wrap everything you can. Now that we have some, IMHO, better XML to work with, we can define a stylesheet: def.xsl
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Tr +ansform"> <xsl:template match = "/book" > <xsl:for-each select = "chapter[@id]" > <h1> Chapter <xsl:value-of select="@id"/> </h1> <xsl:for-each select = "part[@id]" > <h3> Part <xsl:value-of select="@id"/> </h3> <i><xsl:value-of select="info"/></i><br/> <xsl:for-each select = "def[@word]" > <b><u><xsl:value-of select="@word"/></u></b>:<br/> </xsl:for-each> <xsl:value-of select="def"/><br/> <xsl:value-of select="extra"/><br/> </xsl:for-each> </xsl:for-each> </xsl:template> </xsl:stylesheet>
    And finally, the script to transform all of this into 'HTML'
    use strict; use warnings; use XML::LibXML; use XML::LibXSLT; my $xml = XML::LibXML->new(); my $xslt = XML::LibXSLT->new(); my $source = $xml->parse_file('def.xml'); my $style_doc = $xml->parse_file('def.xsl'); my $stylesheet = $xslt->parse_stylesheet($style_doc); my $results = $stylesheet->transform($source); print $stylesheet->output_string($results);

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      First off, i really think you should change the structure of your XML to something like: def.xml

      Why? Why would you pervert a perfectly logical, not to mention practical, document structure, in order for your code to be easier to write? The original format makes sense, why add extra tags everywhere to avoid having to deal with mixed content? Mixed content exists, it's there for a good reason: that's how you write documents.

      What happens if you have more than one definition in the section? Would you have this:

      <part id="1"> <info>infoinfostuffhere</info> <def word="someword">some definition of the word</def> <extra>more info and notes</extra> <extradef word="someword">some definition of an other word</extr +adef> <doubleextra>even more info and notes</doubleextra> </part>

      I don't think it would make sense either!

        Because i am still a newbie at XML. :P

        Seriously, because i didn't know any better ... i see now why the mixed content is OK to have. ninja-joe ... my apologies. If you do take my advice, feel free to ask more questions ... i personally find XSLT and XPath to be somewhat hard to work with until you get the hang of them. While i was developing hacking out the above code, i found it 'easier' (falsely, of course) to wrap everything instead of dealing with mixed content. mirod++ yet again. ;)

        jeffa

        L-LL-L--L-LL-L--L-LL-L--
        -R--R-RR-R--R-RR-R--R-RR
        B--B--B--B--B--B--B--B--
        H---H---H---H---H---H---
        (the triplet paradiddle with high-hat)
        
Re: Picking an XML Module
by CountZero (Bishop) on Aug 03, 2003 at 15:51 UTC

    Have a look at "So many ways to Rome" an interesting artice listing the pros and cons of the various Perl XML-modules. It was presented at the YAPC:EU in Paris and I found it very enlightening.

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Picking an XML Module
by mirod (Canon) on Aug 03, 2003 at 16:07 UTC

    I don't know what limitations of XML::Parser you are refering too. In any case it is a low-level module, and you should use higher-level ones. XML::DOM is also not a good choice IMHO, the DOM being another low-level standard that does not match what we expect from a high level language like Perl.

    XML::Twig (surprise!) and XML::LibXML are the ones I usually recommend.

Re: Picking an XML Module
by liz (Monsignor) on Aug 03, 2003 at 15:36 UTC
    Before you start doing your notes in XML, maybe you should have a look at YAML. And if you still want XML in the end, you can generate that out of the YAML without too much trouble.

    As far as I know, XML::Parser only suffers from the limitations of the underlying XML library Expat.

    Liz

      Actually the XML shown above contains mixed content (the def element in the middle of the text of the section element, so YAML would not cut it here. YAML is designed for serialisation of Perl/Python/Ruby/whatever data structures, it is specifically NOT designed to be equivalent to XML.

      BTW, the one-liner to turn (appropriate) XML into YAML is:

      perl -MXML::Simple -MYAML -e'print Dump( XMLin( "myfile.xml"))'

      (from Stop Using XML Everywhere! Damn It!, that should convince you that I am not an XML fanatic ;--)

        Actually, from an information organization point of view, I was wondering why the def element was at that location. If you would need to generate a list of definitions out of that XML, you would need an XPath expression like //def which can be very bad performance wise.

        Liz

Re: Picking an XML Module
by vek (Prior) on Aug 03, 2003 at 18:06 UTC

    Just out of curiosity, what limitations with XML::Parser are you referring to?

    In the past, you'd probably hear a lot of people trying to steer you away from XML::Parser in favor of an 'actively maintained' module. Well, matts has picked up the XML::Parser reigns and released 2.32 and 2.33 just last week in fact. So you can now add XML::Parser back onto the 'actively maintained XML parsing modules' list :-)

    In the past I actively used XML::Parser until the requirements for my project changed. I needed to be able to validate the XML against a DTD so I switched to XML::LibXML. I would have probably stayed with XML::Parser otherwise.

    -- vek --

      The main limitation of XML::Parser is that it is a low level module: you have to do a lot of work yourself. The best example is probably that you have to buffer the data returned by the character handler, or it will come in several chunks. In general SAX-level handlers are quite a pain to write. And XML::Parser is not even SAX, so you don't get to benefit from the work that is being done at the moment on SAX modules (XML::SAX::Machines or XML::Filter::Dispatcher for example have some very good ideas). OTOH I must sau that antiquated as it is, XML::PArser's interface is a bit more convenient that pure SAX.

      But if I compare this to the simplicity of... XML::Simple (which would not work in this case, it does not deal well with mixed content), or to the power of XML::LibXML's XPath engine, I don't think that XML::Parser is a good choice today.

      There are also some problems with the way XML::Parser deals with entities (especially in attribute values) that can be annoying if your XML uses them.

      I'm only maintaining XML::Parser so that it can ultimately be deprecated (and so that what bugs there are get fixed). Next release will have a LARGE warning in the documentation about how you shouldn't use this module.