Sixtease has asked for the wisdom of the Perl Monks concerning the following question:

Hello.

I keep getting segfaults when I attempt to parse an XML with Czech diacritic characters encoded in UTF-8. I experienced this no matter what parser package I used, except for PurePerl. The segfault only happens for XML files above 2 KB or so... I'm including the code that causes the fault and a link to the minimal XML file giving me the error (if I delete a line, it runs normally).

This has been happening since July roughly and I've been solving it by using PurePerl which now seems to have a bug in it, so I decided to ask help on this matter first.

My perl and machine are: v5.8.8 built for x86_64-linux-thread-multi
Gentoo Linux for amd64 on Core2 Duo, Kernel 2.6.19 with Gentoo patches.

The XML file

#!/usr/bin/perl { package Handler; use strict; use warnings; use encoding 'utf8'; sub new { bless +{} } } use strict; use warnings; use encoding 'utf8'; use XML::SAX::ParserFactory; $XML::SAX::ParserPackage = "XML::LibXML::SAX"; open (my $file, '<:encoding(utf8)', 'train.m.xml'); my $parser = XML::SAX::ParserFactory->parser( "Handler" => Handler->new() ); $parser->parse_file($file);

Replies are listed 'Best First'.
Re: XML::SAX UTF-8 segfault
by Khen1950fx (Canon) on Jan 26, 2007 at 12:21 UTC
    As I understand it, utf8 is the default for SAX for 5.8 and later; so I don't think the encoding is the problem. I ran your script, and I got a few "read" errors for XML::LibXML and XML::LibXML::SAX. They weren't loaded. I played with it and got this:

    #!/usr/bin/perl use strict; use warnings; use XML::SAX::ParserFactory; use XML::LibXML::SAX; $XML::SAX::ParserPackage = "XML::LibXML::SAX"; my $handler = XML::LibXML::SAX->new(); my $p = XML::SAX::ParserFactory->parser(Handler => $handler); $p->parse_uri("http://junk.sixtease.net/train.m.xml");
      Wow, after decomposing the differences between your and my code... I found that I get a segfault when I open the XML file with '<:encoding(utf8)' and it runs normally when I open it with '<'.

      I don't have a clue what's going on here and I think it's pretty sick :-) but I am more grateful than I can say.
      Thank you friend.

Re: XML::SAX UTF-8 segfault
by Anonymous Monk on Jan 26, 2007 at 07:10 UTC
    What LibXML do you have?