in reply to Re: HTML::TokeParser, get_text scrambling rsquo and lsquo
in thread HTML::TokeParser, get_text scrambling rsquo and lsquo

Thank you for your reply. I've written a short program to show the problem:

use HTML::TokeParser; use strict; local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new(\$lines); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() || "<NO TITLE FOUND>"; print "Title: $title \n" ; __END__ <title>egrave: &egrave; : eacute: &eacute; : rsquo: &rsquo; : lsquo: & +lsquo;</title>

I've now tested this at home, and with my web host. At home it works as it should:

Title: egrave: è : eacute: é : rsquo: ’ : lsquo: ‘

At the web host it produces the results previously described:

Title: egrave: è : eacute: é : rsquo: ’ : lsquo: ‘

In case it makes a difference, at home I have:

This is perl, v5.8.8 built for i586-linux-thread-multi

and the web host has:

This is perl, v5.8.5 built for i386-linux-thread-multi

Do you know if this behaviour is a difference between 5.8.5 and 5.8.8? Thank you for any further advice!

Replies are listed 'Best First'.
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo
by shmem (Chancellor) on May 12, 2007 at 09:51 UTC
    Check the LANG setting at your server and at home, I bet they are different. Or it is that the web server's charset defaults to utf-8. Check the config.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Thank you. At home I have

      LANG=en_GB.UTF-8

      The server has

      LANG=en_US

      I'm not sure what to do with this information and whether it is significant.

      Can you give me a pointer to the web server's configuration file? (or should I ask the web host people about this)

        Ah, so it is the other way round, and very odd indeed. Runinng your code I get the same result, and I don't have UTF-8 in my LANG setting, either. It seems that HTML::TokeParser turns on the UTF-8 flag on strings returned by the get_text() method:
        #!/usr/bin/perl use HTML::TokeParser; #use Data::Dump::Streamer; use strict; use Devel::Peek; local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new(\$lines); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; my $title = $tok_par->get_text() || "<NO TITLE FOUND>"; Dump ($title); __DATA__ <title>egrave: &egrave; : eacute: &eacute; : rsquo: &rsquo; : lsquo: & +lsquo;</title> __END__ SV = PV(0x81b4290) at 0x81ed950 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x8207b90 "egrave: \303\250 : eacute: \303\251 : rsquo: \342\20 +0\231 : lsquo: \342\200\230"\0 [UTF8 "egrave: \x{e8} : eacute: \x{e9} + : rsquo: \x{2019} : lsquo: \x{2018}"] CUR = 49 LEN = 52

        - that's why you see the right output on your UTF-8 terminal at home, but garbled stuff on the servers terminal.

        Hmm. I call that a bug :-)

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo
by Krambambuli (Curate) on May 12, 2007 at 14:05 UTC
    As you mentioned that you just want to somehow process the incoming html and then pass it forward for other processing, it might be that something like the code below could help.

    That is, using UTF-8 for what goes out to STDOUT and re-encode the title that was decoded by get_text:
    #!/usr/bin/perl use strict; use warnings; use HTML::Entities; use HTML::TokeParser; binmode(STDOUT, ":utf8"); local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new( \$lines ); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() || "<NO TITLE FOUND>"; print "Title: $title \n" ; my $encoded_title = encode_entities( $title, '\x{ff}-\x{ffff}' ); print "Enc_Title: $encoded_title\n"; __END__ <title>egrave: &egrave; : eacute: &eacute; : rsquo: &rsquo; : lsquo: & +lsquo;</title>
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo
by Anonymous Monk on May 12, 2007 at 09:43 UTC
    print "From HTML::Entities lsquo => chr(8216), rsquo => chr(8217)" +; print "\n"; print "From HTML::Entities lsquo => ",chr(8216),", rsquo => ", chr +(8217);
      This is interesting. This code works the same at home and on the web site. I get the proper quotes both times.