in reply to HTML::TokeParser, get_text scrambling rsquo and lsquo

It might be easier to help/try/check if you could provide a fully runable code snippet, sample data included.

Without it, anyone trying to help will have to do this on his own - and might fail to find a case that matches your problem description.

Help us to help you ;)
  • Comment on Re: HTML::TokeParser, get_text scrambling rsquo and lsquo

Replies are listed 'Best First'.
Re^2: HTML::TokeParser, get_text scrambling rsquo and lsquo
by tridral (Initiate) on May 12, 2007 at 08:45 UTC
    Thank you for your reply. I've written a short program to show the problem:

    use HTML::TokeParser; use strict; local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new(\$lines); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() || "<NO TITLE FOUND>"; print "Title: $title \n" ; __END__ <title>egrave: &egrave; : eacute: &eacute; : rsquo: &rsquo; : lsquo: & +lsquo;</title>

    I've now tested this at home, and with my web host. At home it works as it should:

    Title: egrave: è : eacute: é : rsquo: ’ : lsquo: ‘

    At the web host it produces the results previously described:

    Title: egrave: è : eacute: é : rsquo: ’ : lsquo: ‘

    In case it makes a difference, at home I have:

    This is perl, v5.8.8 built for i586-linux-thread-multi

    and the web host has:

    This is perl, v5.8.5 built for i386-linux-thread-multi

    Do you know if this behaviour is a difference between 5.8.5 and 5.8.8? Thank you for any further advice!

      Check the LANG setting at your server and at home, I bet they are different. Or it is that the web server's charset defaults to utf-8. Check the config.

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
        Thank you. At home I have

        LANG=en_GB.UTF-8

        The server has

        LANG=en_US

        I'm not sure what to do with this information and whether it is significant.

        Can you give me a pointer to the web server's configuration file? (or should I ask the web host people about this)

      As you mentioned that you just want to somehow process the incoming html and then pass it forward for other processing, it might be that something like the code below could help.

      That is, using UTF-8 for what goes out to STDOUT and re-encode the title that was decoded by get_text:
      #!/usr/bin/perl use strict; use warnings; use HTML::Entities; use HTML::TokeParser; binmode(STDOUT, ":utf8"); local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new( \$lines ); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() || "<NO TITLE FOUND>"; print "Title: $title \n" ; my $encoded_title = encode_entities( $title, '\x{ff}-\x{ffff}' ); print "Enc_Title: $encoded_title\n"; __END__ <title>egrave: &egrave; : eacute: &eacute; : rsquo: &rsquo; : lsquo: & +lsquo;</title>
      print "From HTML::Entities lsquo => chr(8216), rsquo => chr(8217)" +; print "\n"; print "From HTML::Entities lsquo => ",chr(8216),", rsquo => ", chr +(8217);
        This is interesting. This code works the same at home and on the web site. I get the proper quotes both times.