XML::SAX::ParserFactory policy and differences between parser implementations

seki has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am trying to parse some big xml files while not eating all the user memory, so XML::SAX::Parser seems to be the solution.

My files may contain different diacritics, so preserving the file utf-8 encoding is needed, but XML::SAX::ParserFactory (code taken from XML::SAX::Parser examples) is giving by default a parser that does not get the encoding from the document declaration.

I then discovered that there is more than one SAX parser on my system with

#debug : list known parsers
my $parsers = XML::SAX->parsers();
say np $parsers;
[download]

and by accident while testing all of them, only XML::LibXML::SAX::Parser seems to be able to get the document encoding.

I wonder why not all parser implementation are able to give all the document properties and how I am supposed to know the differences but with trial and error...

Also, why do I need to use explicitly XML::LibXML::SAX::Parser while the documentation of XML::LibXML only tells about XML::LibXML::SAX that miss the Encoding attribute of the xml declaration?

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Comment on XML::SAX::ParserFactory policy and differences between parser implementations Select or Download Code

Replies are listed 'Best First'.
Re: XML::SAX::ParserFactory policy and differences between parser implementations by beech (Parson) on Mar 01, 2016 at 19:43 UTC
I am trying to parse some big xml files while not eating all the user memory, so XML::SAX::Parser seems to be the solution. The solution is called XML::Twig, see http://xmltwig.org/tutorial/ update: It seems you're already aware of twig, anyway, the docs aren't clear what is supposed to be going on, but the information is out there :) use xml_decl handler Read more... (2 kB)	[reply] [d/l]
Re^2: XML::SAX::ParserFactory policy and differences between parser implementations by seki (Monk) on Mar 02, 2016 at 01:54 UTC
Yes, I am aware of XML::Twig, but it is not suitable to my needs (or at leat I did not see how I could use it, because I need to "patch" an already parsed element to adjust its value during the parsing ans split of a big block of elements that I prefer not to keep in memory) As you mention yourself in your results, the different SAX parsers are not consistent in regard to the SAX events, at least for XML::SAX::Expat that includes the encoding into start_document() data instead of xml_decl() data or XML::SAX::PurePerl that does not notify xml_decl() at all Also I do not get the same results as you with my test program and data. Could you check for what file XML::LibXML::SAX manages to give you an encoding? You can see it does not with my utf-8 sample. data.xml `<?xml version="1.0" encoding="UTF-8" ?> <root> <foo> <bar attr="baz">héhé mes 2 €</bar> <baz other="dummy"/> </foo> </root>` [download] test_sax.xml use strict; use warnings; use feature 'say'; #~ use Say; #portability trick for 5.8.8 use XML::SAX::ParserFactory; use XML::SAX::Writer; my $input = $ARGV[0] or die "usage: $0 <file.xml> [parser_package]"; $XML::SAX::ParserPackage = $ARGV[1] if $ARGV[1]; my $output; #just for not outputting to STDOUT my $writer = new XML::SAX::Writer(Output => \$output); my $handler = new SaxHandler( Handler => $writer ); my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); say sprintf "parser is %s (%s)", ref $parser, $parser->VERSION ; $parser->parse_file($input); { package SaxHandler; use base 'XML::SAX::Base'; use Data::Printer {indent=>2}; use feature 'say'; #~ use Say; #portability trick for 5.8.8 sub xml_decl { my ($self, $decl) = @_; say "decl ", np $decl; $self->SUPER::xml_decl($decl); } sub start_document { my ($self, $doc) = @_; say "document ", np $doc; $self->SUPER::start_document($doc); } sub start_element { my ($self, $el) = @_; #~ say "start element " . $el->{LocalName}; $self->SUPER::start_element($el); } } [download] my results: macbookseb:perl seb$ perl -v This is perl 5, version 22, subversion 1 (v5.22.1) built for darwin-th +read-multi-2level[...] macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::PurePerl parser is XML::SAX::PurePerl (0.99) document \ {} macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::Expat parser is XML::SAX::Expat (0.51) document \ { Encoding "UTF-8", Standalone "", Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX parser is XML::LibXML::SAX (2.0124) document \ {} decl \ { Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX::Parse +r parser is XML::LibXML::SAX::Parser (2.0124) document \ {} decl \ { Encoding "UTF-8", Version 1.0 } [download]	[reply] [d/l] [select]
Re^3: XML::SAX::ParserFactory policy and differences between parser implementations by beech (Parson) on Mar 02, 2016 at 02:38 UTC
Also I do not get the same results as you with my test program and data. What do you get with my program? update: Could you check for what file XML::LibXML::SAX manages to give you an encoding? When its not utf-8 when its `encoding="ISO-8859-1"`	[reply] [d/l]
Re^4: XML::SAX::ParserFactory policy and differences between parser implementations by seki (Monk) on Mar 02, 2016 at 02:57 UTC
Re: XML::SAX::ParserFactory policy and differences between parser implementations by choroba (Cardinal) on Mar 03, 2016 at 19:13 UTC
Note that there's also XML::LibXML::Reader. It's a pull parser, similar to SAX, but different - instead of giving it callbacks, it tells you what state it's in, and you can react accordingly. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l]