Apache +XML parsing

soliplaya has asked for the wisdom of the Perl Monks concerning the following question:

Eminences, I beg audience related to the following theme :
I am writing a (perl) cgi-bin script to run under mod_perl and thus persistently, said script being repeatedly called upon to parse XML data. The data consists, each time, of a published document description, estimated to be between 2 and 10 Kb.
After parsing, I need to be able to extract most tags and attribute values, to pass them to some other software which does not understand XML.
Coming from a reputed source, I do not expect many issues with the XML per se.
But having had some problems before with memory leaks and/or performance in some perl XML modules, I am asking for your benign recommendations as to what works resonably fast, repeatedly and safely under mod_perl, without swelling the server's memory footprint too much.

Thank you in advance.

Comment on Apache +XML parsing

Replies are listed 'Best First'.

Re: Apache +XML parsing
by jeffa (Bishop) on Nov 08, 2008 at 17:07 UTC

You should be OK with most XML modules, but if you really need to conserve memory then i recommend looking into event based parsers like SAX rather than tree based parsers. A quick google search yielded this document which does a pretty good job of explaining the difference: http://www.informit.com/articles/article.aspx?p=27006&seqNum=7&rll=1

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

[reply]
[d/l]

Re^2: Apache +XML parsing

by Jenda (Abbot) on Nov 08, 2008 at 23:17 UTC

That article is six years old. And is written as if DOM and SAX were the only contenders. soliplaya, have a look at XML::Twig and XML::Rules. The first is often recommended here, the second is mine :-) and a nice combination of event and tree based parsing.

Jenda
Support Denmark!
Defend the free world!

[reply]

Re^3: Apache +XML parsing

by soliplaya (Beadle) on Nov 09, 2008 at 14:00 UTC

my $href = {
  'TI' => [ 'content of <PubArticle><Article><Title> tag' ],
  'AU' => [ 'content of <PubArticle><Article><Authors><Author name="au
+thor1" tag',
            'content of <PubArticle><Article><Authors><Author name="au
+thor2" tag',
            etc.. ],
  'REF' => [ and so on... ]
};
[download]

[reply]
[d/l]

Re^4: Apache +XML parsing

by Jenda (Abbot) on Nov 09, 2008 at 17:55 UTC

Re^5: Apache +XML parsing

by soliplaya (Beadle) on Nov 18, 2008 at 10:53 UTC

Some notes below your chosen depth have not been shown here

Re^3: Apache +XML parsing

by Anonymous Monk on Nov 09, 2008 at 09:46 UTC

XML::Rules is sax (maybe both), and XML::Twig is definetly both.

[reply]

Re^4: Apache +XML parsing

by Jenda (Abbot) on Nov 09, 2008 at 17:40 UTC