in reply to Re: XML::SAX::ParserFactory policy and differences between parser implementations
in thread XML::SAX::ParserFactory policy and differences between parser implementations

Yes, I am aware of XML::Twig, but it is not suitable to my needs (or at leat I did not see how I could use it, because I need to "patch" an already parsed element to adjust its value during the parsing ans split of a big block of elements that I prefer not to keep in memory)

As you mention yourself in your results, the different SAX parsers are not consistent in regard to the SAX events, at least for XML::SAX::Expat that includes the encoding into start_document() data instead of xml_decl() data or XML::SAX::PurePerl that does not notify xml_decl() at all

Also I do not get the same results as you with my test program and data. Could you check for what file XML::LibXML::SAX manages to give you an encoding? You can see it does not with my utf-8 sample.

data.xml

<?xml version="1.0" encoding="UTF-8" ?> <root> <foo> <bar attr="baz">héhé mes 2 €</bar> <baz other="dummy"/> </foo> </root>

test_sax.xml

use strict; use warnings; use feature 'say'; #~ use Say; #portability trick for 5.8.8 use XML::SAX::ParserFactory; use XML::SAX::Writer; my $input = $ARGV[0] or die "usage: $0 <file.xml> [parser_package]"; $XML::SAX::ParserPackage = $ARGV[1] if $ARGV[1]; my $output; #just for not outputting to STDOUT my $writer = new XML::SAX::Writer(Output => \$output); my $handler = new SaxHandler( Handler => $writer ); my $parser = XML::SAX::ParserFactory->parser( Handler => $handler ); say sprintf "parser is %s (%s)", ref $parser, $parser->VERSION ; $parser->parse_file($input); { package SaxHandler; use base 'XML::SAX::Base'; use Data::Printer {indent=>2}; use feature 'say'; #~ use Say; #portability trick for 5.8.8 sub xml_decl { my ($self, $decl) = @_; say "decl ", np $decl; $self->SUPER::xml_decl($decl); } sub start_document { my ($self, $doc) = @_; say "document ", np $doc; $self->SUPER::start_document($doc); } sub start_element { my ($self, $el) = @_; #~ say "start element " . $el->{LocalName}; $self->SUPER::start_element($el); } }

my results:

macbookseb:perl seb$ perl -v This is perl 5, version 22, subversion 1 (v5.22.1) built for darwin-th +read-multi-2level[...] macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::PurePerl parser is XML::SAX::PurePerl (0.99) document \ {} macbookseb:perl seb$ perl test_sax.pl data.xml XML::SAX::Expat parser is XML::SAX::Expat (0.51) document \ { Encoding "UTF-8", Standalone "", Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX parser is XML::LibXML::SAX (2.0124) document \ {} decl \ { Version 1.0 } macbookseb:perl seb$ perl test_sax.pl data.xml XML::LibXML::SAX::Parse +r parser is XML::LibXML::SAX::Parser (2.0124) document \ {} decl \ { Encoding "UTF-8", Version 1.0 }

Replies are listed 'Best First'.
Re^3: XML::SAX::ParserFactory policy and differences between parser implementations
by beech (Parson) on Mar 02, 2016 at 02:38 UTC

    Also I do not get the same results as you with my test program and data.

    What do you get with my program?

    update:

    Could you check for what file XML::LibXML::SAX manages to give you an encoding?

    When its not utf-8 when its  encoding="ISO-8859-1"

      with your code and my data.xml:
      XML::SAX::PurePerl doc () XML::SAX::Expat doc ("Standalone", "", "Encoding", "UTF-8", "Version", "1.0") XML::SAX::ExpatXS doc () decl ("Encoding", "UTF-8", "Version", "1.0", "Standalone", undef) XML::LibXML::SAX::Parser doc () decl ("Version", "1.0", "Encoding", "UTF-8") XML::LibXML::SAX doc () decl ("Version", "1.0")
      It agrees with my own test: XML::LibXML::SAX does not get the encoding while XML:LibXML::SAX::Parser does.
      It seems that the parser must be carefully and explicitly selected to get consistent results, rather than letting the factory pass a broken parser.

      Update: indeed it seems that XML::LibXML::SAX fails to give the encoding of an utf-8 encoded file while it succeeds with an iso-8859-1. I have no other encoding from xml file right available for another test.