rellaboyina has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,
I am having some data which will be stored in XML format and this needs to be parsed using the parser module XML::Parser and XML::Parser::Expat.
This data consists of some special characters like ", , , , , , , ". But when I try to parse the particular record with these special characters using the method parse(), I got an error "not well-formed (invalid token)". Here I am posing my code:
sub parse { my $self = shift; my $arg = shift; my @expat_options = (); my ($key, $val); while (($key, $val) = each %{$self}) { push(@expat_options, $key, $val) unless exists $self->{Non_Expat_Options}->{$key}; } my $expat = new XML::Parser::Expat(@expat_options, @_); my %handlers = %{$self->{Handlers}}; my $init = delete $handlers{Init}; my $final = delete $handlers{Final}; $expat->setHandlers(%handlers); if ($self->{Base}) { $expat->base($self->{Base}); } &$init($expat) if defined($init); my @result = (); my $result; eval { $result = $expat->parse($arg); }; my $err = $@; if ($err) { $expat->release; die $err; } if ($result and defined($final)) { if (wantarray) { @result = &$final($expat); } else { $result = &$final($expat); } } $expat->release; return unless defined wantarray; return wantarray ? @result : $result; }

where $arg will contain the xml data to be parsed which is having the special characters to be parsed. The xml data will look like this :
<record> <source-app >ABC</source-app> <ref-type>6</ref-type> <contributors> <authors> <author> <style face="normal" font="default" size="100%">Dvoøák, Petr +</style> </author> </authors> </contributors> <titles> <title> <style face="normal" font="default" size="100%">Systematická te +ologie I : ø*mskokatolická perspektiva</style> </title> </titles> <pages> <style>285 s.</style> </pages> <edition> <style>1. vyd.</style> </edition> <keywords> <keyword> <style>uèen* katolické c*rkve</style> </keyword> </keywords> <dates> <year> <style>1996</style> </year> </dates> <pub-location> <style>Brno&#xD;Praha</style> </pub-location> <publisher> <style>Centrum pro studium demokracie a kultury ;&#xD;Èeská køe +sanská akademie</style> </publisher> <notes> <style>uspoøádali Francis S. Fiorenza a John P. Galvin ; [z angl +iètiny pøeložili Petr Dvoøák ... et al.]&#xD;20 cm&#xD;Pozn.&#xD +;Pozn. o autorech traktátù&#xD;Zkratky&#xD;Bibliogr.&#xD;Odkazy na +lit.&#xD;Jmenný a vìcný rejstø*k</style> </notes> </record>

The data corresponding to the tags in the above xml is Czec. My problem is that the Parser.pm is not able to parse these characters. Could anyone please help me out in solving this one. Thanks alot.

Replies are listed 'Best First'.
Re: Need help in parsing the special characters using XML::Parser
by mirod (Canon) on Nov 19, 2007 at 09:51 UTC

    I did not look at the code, but most likely, the problem comes from the data. It is most likely in ISO-8859-2 (aka Latin 2).

    If no encoding is specified, XML::Parser expects the data to be in UTF8. The proper way to do this is to include this information in the document itself, by starting it with the XML Declaration:

    <?xml version="1.0" encoding="ISO-8859-2"?>

    You could also pre-process the XML, using iconv for example to convert it to UTF-8. An other way, only if there is really no way for you to change the XML, would be to use the ProtocolEncoding option in XML::Parser.

    Bear in mind that all the data that your handlers will receive from the parser will be UTF-8, so if you want to get it back in Latin 2 you will have to convert it back, using Encode, or iconv for example.

      Great resource, thanks for providing it!
Re: Need help in parsing the special characters using XML::Parser
by siva kumar (Pilgrim) on Nov 19, 2007 at 08:58 UTC
    Have a look at the post.. This may be very useful for you ..
    http://www.perlmonks.org/?node_id=404658