seki has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am trying to parse some big xml files while not eating all the user memory, so XML::SAX::Parser seems to be the solution.

My files may contain different diacritics, so preserving the file utf-8 encoding is needed, but XML::SAX::ParserFactory (code taken from XML::SAX::Parser examples) is giving by default a parser that does not get the encoding from the document declaration.

I then discovered that there is more than one SAX parser on my system with

#debug : list known parsers my $parsers = XML::SAX->parsers(); say np $parsers;
and by accident while testing all of them, only XML::LibXML::SAX::Parser seems to be able to get the document encoding.

I wonder why not all parser implementation are able to give all the document properties and how I am supposed to know the differences but with trial and error...

Also, why do I need to use explicitly XML::LibXML::SAX::Parser while the documentation of XML::LibXML only tells about XML::LibXML::SAX that miss the Encoding attribute of the xml declaration?

The best programs are the ones written when the programmer is supposed to be working on something else. - Melinda Varian

Replies are listed 'Best First'.
Re: XML::SAX::ParserFactory policy and differences between parser implementations
by beech (Parson) on Mar 01, 2016 at 19:43 UTC

    I am trying to parse some big xml files while not eating all the user memory, so XML::SAX::Parser seems to be the solution.

    The solution is called XML::Twig, see http://xmltwig.org/tutorial/

    update: It seems you're already aware of twig,

    anyway, the docs aren't clear what is supposed to be going on, but the information is out there :) use xml_decl handler

      Yes, I am aware of XML::Twig, but it is not suitable to my needs (or at leat I did not see how I could use it, because I need to "patch" an already parsed element to adjust its value during the parsing ans split of a big block of elements that I prefer not to keep in memory)

      As you mention yourself in your results, the different SAX parsers are not consistent in regard to the SAX events, at least for XML::SAX::Expat that includes the encoding into start_document() data instead of xml_decl() data or XML::SAX::PurePerl that does not notify xml_decl() at all

      Also I do not get the same results as you with my test program and data. Could you check for what file XML::LibXML::SAX manages to give you an encoding? You can see it does not with my utf-8 sample.

      data.xml

      <?xml version="1.0" encoding="UTF-8" ?> <root> <foo> <bar attr="baz">héhé mes 2 €</bar> <baz other="dummy"/> </foo> </root>

      test_sax.xml

      use strict; use warnings; use feature 'say'; #~ use Say; #portability trick for 5.8.8 use XML::SAX::ParserFactory; use XML::SAX::Writer; my $input = $ARGV[0] or die "usage: $0 <file.xml> [parser_package]"; $XML::SAX::ParserPackage = $ARGV[1] if $ARGV[1]; my $output; #just for not outputting to STDOUT my $writer = new XML::SAX::Writer(Output => \$output); my $handler = new SaxHandler( Handler => $writer ); my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); say sprintf "parser is %s (%s)", ref $parser, $parser->VERSION ; $parser->parse_file($input); { package SaxHandler; use base 'XML::SAX::Base'; use Data::Printer {indent=>2}; use feature 'say'; #~ use Say; #portability trick for 5.8.8 sub xml_decl { my ($self, $decl) = @_; say "decl ", np $decl; $self->SUPER::xml_decl($decl); } sub start_document { my ($self, $doc) = @_; say "document ", np $doc; $self->SUPER::start_document($doc); } sub start_element { my ($self, $el) = @_; #~ say "start element " . $el->{LocalName}; $self->SUPER::start_element($el); } }

      my results:

      macbookseb:perl seb$ perl -v This is perl 5, version 22, subversion 1 (v5.22.1) built for darwin-th +read-multi-2level[...] macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::PurePerl parser is XML::SAX::PurePerl (0.99) document \ {} macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::Expat parser is XML::SAX::Expat (0.51) document \ { Encoding "UTF-8", Standalone "", Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX parser is XML::LibXML::SAX (2.0124) document \ {} decl \ { Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX::Parse +r parser is XML::LibXML::SAX::Parser (2.0124) document \ {} decl \ { Encoding "UTF-8", Version 1.0 }

        Also I do not get the same results as you with my test program and data.

        What do you get with my program?

        update:

        Could you check for what file XML::LibXML::SAX manages to give you an encoding?

        When its not utf-8 when its  encoding="ISO-8859-1"

Re: XML::SAX::ParserFactory policy and differences between parser implementations
by choroba (Cardinal) on Mar 03, 2016 at 19:13 UTC
    Note that there's also XML::LibXML::Reader. It's a pull parser, similar to SAX, but different - instead of giving it callbacks, it tells you what state it's in, and you can react accordingly.
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,