Re: HTML::TokeParser, get_text scrambling rsquo and lsquo

Replies are listed 'Best First'.
Re^2: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 08:45 UTC
Thank you for your reply. I've written a short program to show the problem: `use HTML::TokeParser; use strict; local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new(\$lines); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() \|\| "<NO TITLE FOUND>"; print "Title: $title \n" ; __END__ <title>egrave: è : eacute: é : rsquo: ’ : lsquo: & +lsquo;</title>` [download] I've now tested this at home, and with my web host. At home it works as it should: Title: egrave: è : eacute: é : rsquo: ’ : lsquo: ‘ At the web host it produces the results previously described: Title: egrave: è : eacute: é : rsquo: â€™ : lsquo: â€˜ In case it makes a difference, at home I have: This is perl, v5.8.8 built for i586-linux-thread-multi and the web host has: This is perl, v5.8.5 built for i386-linux-thread-multi Do you know if this behaviour is a difference between 5.8.5 and 5.8.8? Thank you for any further advice!	[reply] [d/l]
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by shmem (Chancellor) on May 12, 2007 at 09:51 UTC
Check the LANG setting at your server and at home, I bet they are different. Or it is that the web server's charset defaults to utf-8. Check the config. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^4: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 10:06 UTC
Thank you. At home I have LANG=en_GB.UTF-8 The server has LANG=en_US I'm not sure what to do with this information and whether it is significant. Can you give me a pointer to the web server's configuration file? (or should I ask the web host people about this)	[reply]
Re^5: HTML::TokeParser, get_text scrambling rsquo and lsquo by shmem (Chancellor) on May 12, 2007 at 10:49 UTC
Re^6: HTML::TokeParser, get_text scrambling rsquo and lsquo by Joost (Canon) on May 12, 2007 at 11:05 UTC
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by Krambambuli (Curate) on May 12, 2007 at 14:05 UTC
As you mentioned that you just want to somehow process the incoming html and then pass it forward for other processing, it might be that something like the code below could help. That is, using UTF-8 for what goes out to STDOUT and re-encode the title that was decoded by get_text: #!/usr/bin/perl use strict; use warnings; use HTML::Entities; use HTML::TokeParser; binmode(STDOUT, ":utf8"); local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new( \$lines ); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() \|\| "<NO TITLE FOUND>"; print "Title: $title \n" ; my $encoded_title = encode_entities( $title, '\x{ff}-\x{ffff}' ); print "Enc_Title: $encoded_title\n"; __END__ <title>egrave: è : eacute: é : rsquo: ’ : lsquo: & +lsquo;</title> [download]	[reply] [d/l]
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by Anonymous Monk on May 12, 2007 at 09:43 UTC
`print "From HTML::Entities lsquo => chr(8216), rsquo => chr(8217)" +; print "\n"; print "From HTML::Entities lsquo => ",chr(8216),", rsquo => ", chr +(8217);` [download]	[reply] [d/l]
Re^4: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 10:02 UTC
This is interesting. This code works the same at home and on the web site. I get the proper quotes both times.	[reply]