Re^2: processing massive XML files with XML::Twig

Did you try? I mean did you compare the performances of XML::Twig and XML::SAX? Because I did, for a simple benchmark. Look at the last table.

SAX is convenient because with modules like SAX::Machines it allows you to create pipelines of SAX filters, plug-in dumps... It is IMHO a pain to use. It is also demonstrably slow. At least in Perl.

Sorry, you hit one of my pet peeves ;--)

If you want better performance than XML::Twig, you can use XML::LibXML. The API is different (pure-DOM + XPath + fewer convenience methods than XML::Twig), and it is more difficult to process big files (but XML::LibXML uses less memory than XML::Twig, so you are more likely to be able to load the entire XML in memory).

Comment on Re^2: processing massive XML files with XML::Twig

Replies are listed 'Best First'.
see XML::LibXML::Reader by myuserid7 (Scribe) on Dec 06, 2008 at 13:56 UTC
actually, XML::LibXML now has a pull parser (XML::LibXML::Reader) that doesn't read the entire dom into memory. much faster than XML::Twig. i've used it successfully.	[reply]
Re: see XML::LibXML::Reader by Anonymous Monk on Dec 06, 2008 at 14:15 UTC
much faster than XML::Twig code sample please	[reply]
Re^2: see XML::LibXML::Reader by myuserid7 (Scribe) on Dec 06, 2008 at 17:51 UTC
I agree the documentation for XML::LibXML is unfulfilling, but did you look at the test cases in the distribution? http://search.cpan.org/src/PAJAS/XML-LibXML-1.69/t/40reader.t	[reply]
Re^3: see XML::LibXML::Reader by Anonymous Monk on Dec 07, 2008 at 02:51 UTC
Re: see XML::LibXML::Reader by mirod (Canon) on Dec 07, 2008 at 10:50 UTC
Interesting. I have to see if I can see this as an alternate parser for XML::Twig. Or to create a different module alltogether, that combnes the speed of libxml2 with the convenience (IMHO ;--) of XML::Twig. It would be great if you (or someone else!) could provide code examples for the Ways to Rome" series.	[reply]
Re^2: see XML::LibXML::Reader by Anonymous Monk on Dec 07, 2008 at 21:52 UTC
Your benchmarking methodology is much less accurate than it could be. You shouldn't be measuring the time it takes to fork a new process and load the modules, and should be measuring multiple runs and averaging the results- ex. timethese(-5, ...).	[reply]
Re^3: see XML::LibXML::Reader by mirod (Canon) on Dec 08, 2008 at 10:01 UTC