in reply to Apache +XML parsing

You should be OK with most XML modules, but if you really need to conserve memory then i recommend looking into event based parsers like SAX rather than tree based parsers. A quick google search yielded this document which does a pretty good job of explaining the difference: http://www.informit.com/articles/article.aspx?p=27006&seqNum=7&rll=1

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Re^2: Apache +XML parsing
by Jenda (Abbot) on Nov 08, 2008 at 23:17 UTC
      Thanks Brothers.
      Jenda, your XML::Rules module look interesting, and I'd like to give it a try.
      What I need to do is fairly simple and boring : I need to parse a multi-level XML document, contained in the scaler $xmldoc, representing a Journal Article (*), into a simple hash like
      my $href = { 'TI' => [ 'content of <PubArticle><Article><Title> tag' ], 'AU' => [ 'content of <PubArticle><Article><Authors><Author name="au +thor1" tag', 'content of <PubArticle><Article><Authors><Author name="au +thor2" tag', etc.. ], 'REF' => [ and so on... ] };

      The end-result I want thus, is a hash in which each key corresponds to an arrayref, the array containing one or more string elements, these being picked up from tag attributes and/or values from the original XML document. I admit I am a bit lost after the first read of the on-line doc. I guess what I don't see very clearly, from the first example at the head of the doc, is how I get the result in my $href hash.
      (*) for a full example of the source XML, use this link : http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=18632282

        You can use the XML::Rules->inferRulesFromExample() (or XML::Rules->inferRulesFromDTD() if you have the DTD) to get the basic set of rules for the document and see what data structure would that create (using Data::Dumper). Then you can start tweaking the rules to create a nicer structure. For example, if you do not need the author names split into parts and only want the valid ones you can delete the 'Author' from the list of tags with the 'as array' built-in rule and add a rule like this:

        'Author' => sub { return unless $_[1]->{ValidYN} eq 'Y'; return "$_[1]->{ForeName} $_[1]->{LastName}"; },
        and see what structure do you get.

        And then continue tweaking the rules to filter the stuff you are not interested in, format stuff the way you want, rename hash keys etc.

      XML::Rules is sax (maybe both), and XML::Twig is definetly both.

        Nope. Rules is a stream parser with a twist, but is in no way related to SAX (Simple API for XML) standard and XML::Twig is a tree based parser with a twist with no relation to the DOM standard. Both sit on top of XML::Parser (actually, XML::Parser::Expat in case of XML::Rules, but that's part of the package). There are several Perl implementations of both SAX and DOM and several modules that have their own different (usually more perlish) API.