Re: record-by-record XML parsing

Replies are listed 'Best First'.
Re^2: record-by-record XML parsing by tbusch (Sexton) on Sep 09, 2008 at 16:59 UTC
Hi, the document could be as big as 1 GB. The fastest I know is XML::LibXML but even that module doesn't seem quick enough. Thomas.	[reply]
Re^3: record-by-record XML parsing by eff_i_g (Curate) on Sep 09, 2008 at 18:23 UTC
Honestly, I've only worked with XML::Twig so I cannot tell you how it compares to other modules. However, I recommend looking at this FAQ entry: That said, ifyou need to process a document that is too big to fit memory and XML::Twig is too slow for you, my reluctant advice would be to use "bare" XML::Parser. It won't be as easy to use as XML::Twig: basically with XML::Twig you trade some speed (depending on what you do from a factor 3 to... none) for ease-of-use, but it will be easier IMHO than using SAX (albeit not standard), and at this point a LOT faster (see the last test in simple benchmark).	[reply]
Re^3: record-by-record XML parsing by mirod (Canon) on Sep 09, 2008 at 18:36 UTC
1 GB is going to be slow no matter what. If you have enough memory to load the document entirely in memory, then XML::LibXML will be the fastest option. If you can't load it and you need to use SAX, then performances drop dramatically (as mentioned previously, look at this benchmark). I am not sure the pull interface is available from the Perl module, but that might give you XML::LibXML speed without having to load the whole document (I have never used it myself, so I am not sure if that's even possible). XML::Twig would be a better option than using SAX. And indeed, XML::Parser might be your best bet, even if it's not the easiest module to use. Beyond that, you get into dangerous territory: format the XML properly, get rid of potential comments, CDATA sections, PIs and entities, and use regexps. I would try the other options before though.	[reply]