EDevil has asked for the wisdom of the Perl Monks concerning the following question:

Hello. I have to similar machines, running the same OS (Debian Sarge) and both use perl from the deb packages. So, everything should work the same... But, I get different results regarding UTF encodings. Running this script on the two machines does not give the same result and I don't know why:
andre@blogs-dev:~$ perl -e 'use Encode; use LWP::Simple; use Data::Dum +per; my $html = get("http://ljsapo.blogspot.com/feeds/posts/full"); u +se XML::Simple; my $parsed = XMLin($html); my $text = $parsed->{"entr +y"}->{"tag:blogger.com,1999:blog-20104370.post-115522283622549946"}-> +{"content"}->{"content"}; print Dumper(Encode::is_utf8($text));' $VAR1 = ''; andre@blogs-dev:~$
andre.cruz@blogs1:~$ perl -e 'use Encode; use LWP::Simple; use Data:: +Dumper; my $html = get("http://ljsapo.blogspot.com/feeds/posts/full") +; use XML::Simple; my $parsed = XMLin($html); my $text = $parsed->{"e +ntry"}->{"tag:blogger.com,1999:blog-20104370.post-115522283622549946" +}->{"content"}->{"content"}; print Dumper(Encode::is_utf8($text));' $VAR1 = '1'; andre.cruz@blogs1:~$
They both use the same Perl package version:
andre@blogs-dev:~$ perl -v This is perl, v5.8.4 built for i386-linux-thread-multi Copyright 1987-2004, Larry Wall Perl may be copied only under the terms of either the Artistic License + or the GNU General Public License, which may be found in the Perl 5 source ki +t. Complete documentation for Perl, including FAQ lists, should be found +on this system using `man perl' or `perldoc perl'. If you have access to + the Internet, point your browser at http://www.perl.com/, the Perl Home Pa +ge. andre@blogs-dev:~$
Also they have the same locale settings:
andre@blogs-dev:~$ locale LANG=POSIX LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= andre@blogs-dev:~$
This script just fetches a known feed url and illustrates my problem on one of the XML nodes... The string the XML parser returns has the utf flag ON only on one of the machines... Does anyone know what is the difference between these two machines?

Replies are listed 'Best First'.
Re: Encoding differences
by almut (Canon) on Feb 26, 2007 at 16:33 UTC

    You say the machines are similar (but not identical), so you might want to check whether you're using the exact same libraries/versions. Just as an idea, XML::Simple can work with either XML::Parser or XML::SAX (which in turn can use different SAX parsers). Also, XML::Parser would depend on expat, of which you might have different versions. (ldd and strace are your friends here, if you can't find out otherwise...)

    Another thing to check is whether LWP::Simple is actually fetching the same XML. Maybe, for some reason, it's sending a different Accept-Charset: header along with the request, which the server honors by returning a different encoding (I wouldn't know why exactly... but who knows). Just dump and compare the contents of your $html variable. If they're identical, you've narrowed things down to some later processing step... Well, you get the idea.

      It turns out that XML::Simple was indeed using different XML parsers on the two machines. XML::Simple was the same version but I didn't remember it uses other modules underneath. Thanks!
Re: Encoding differences
by ikegami (Patriarch) on Feb 26, 2007 at 16:26 UTC

    I'd be interested in seeing the output if you added

    print Dumper($text); print XML::Simple->VERSION;
A reply falls below the community's threshold of quality. You may see it by logging in.