HTML::TokeParser, get_text scrambling rsquo and lsquo

tridral has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML::TokeParser, get_text scrambling rsquo and lsquo by Joost (Canon) on May 11, 2007 at 19:01 UTC
I suspect that the lsquo and rsquo get encoded as utf-8 characters, since they're not part of the perl's default 1-byte encoding (latin-1). That means they'll be more than 1 byte in length. How you need to handle that is dependent on your desired output encoding. See perluniintro especially the section on encoding layers. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 10:14 UTC
I think it's a good point about the desired output encoding. I'm only reading the html to produce another html file, so it would suit me just as well to read the text raw without interpreting the html codes. Is there something similar to get_text that just delivers the text without interpreting it first? I've not found anything like that when reading about TokeParser. It may be that the problem can be solved by looking at character encodings as people have suggested, but in case that falls through, I'd also like to look at the possibility of reading uninterpreted html. Thanks for your help	[reply]
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by Joost (Canon) on May 12, 2007 at 10:57 UTC
I don't see any method to get the "raw" text either. In any case, if the output encoding doesn't matter, just open the output file in utf8 mode and set the correct encoding in the html file (not needed if the file is on a webserver that sends the correct content-type w/ charset header for the file): `open my $out,">:utf8",$filename or die $!; # print head start print $out q(<META HTTP-EQUIV="Content-Type" CONTENT="text/html; chars +et=UTF-8">); # print rest of head and document` [download] "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by wfsp (Abbot) on May 13, 2007 at 10:13 UTC
reading... html to produce... html I prefer to avoid jiggering about with the encoding too. In my experience it always ends in tears. :-) Here's what I do (I'm using HTML::TokeParser::Simple in this case). #!/usr/bin/perl use strict; use warnings; use lib 'lib'; use MyParser; my ($p, $txt); my $html = do{local $/;<DATA>}; $p = MyParser->new(file => 'test.html') or die "can't parse: $!\n"; $txt = $p->get_title; print "$txt\n"; $p->get_tag('p'); $txt = $p->get_txt('p'); # upto a closing p tag print "$txt\n"; __DATA__ <html> <head> <title>egrave: è : eacute : é : rsquo: ’ : lsquo: +‘</title> </head> <body> <p> one <span class="second">two</span> three <br> four five six </p> </body> </html> [download] MyParser.pm `package MyParser; use strict; use warnings; use HTML::TokeParser::Simple; use base qw(HTML::TokeParser::Simple); sub get_title{ my ($self) = @_; $self->get_tag('title') or return; $self->get_txt('title'); } sub get_txt{ my ($self, $tag) = @_; my ($txt); while (my $t = $self->get_token){ last if $t->is_end_tag($tag); next if $t->is_start_tag or $t->is_end_tag; $txt .= $t->as_is if $t->is_text; } for ($txt){ s/\n/ /g; s/^\s+//; s/\s+$//; s/\s+/ /g; } return $txt; } 1;` [download] If, like me, getting the title is a frequent task you can add a method to the wrapper to do that as I've done here. I've also tried to emulate HTML::TokeParser's `get_trimmed_text` It skips any tags found before the required end tag (although this won't apply to titles). I'd be interested in hearing comments from monks if any glaring fopahs have been committed. :-)	[reply] [d/l] [select]
Re^4: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on Jun 05, 2007 at 18:38 UTC
Re: HTML::TokeParser, get_text scrambling rsquo and lsquo by Krambambuli (Curate) on May 11, 2007 at 17:20 UTC
It might be easier to help/try/check if you could provide a fully runable code snippet, sample data included. Without it, anyone trying to help will have to do this on his own - and might fail to find a case that matches your problem description. Help us to help you ;)	[reply]
Re^2: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 08:45 UTC
Thank you for your reply. I've written a short program to show the problem: `use HTML::TokeParser; use strict; local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new(\$lines); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() \|\| "<NO TITLE FOUND>"; print "Title: $title \n" ; __END__ <title>egrave: è : eacute: é : rsquo: ’ : lsquo: & +lsquo;</title>` [download] I've now tested this at home, and with my web host. At home it works as it should: Title: egrave: è : eacute: é : rsquo: ’ : lsquo: ‘ At the web host it produces the results previously described: Title: egrave: è : eacute: é : rsquo: â€™ : lsquo: â€˜ In case it makes a difference, at home I have: This is perl, v5.8.8 built for i586-linux-thread-multi and the web host has: This is perl, v5.8.5 built for i386-linux-thread-multi Do you know if this behaviour is a difference between 5.8.5 and 5.8.8? Thank you for any further advice!	[reply] [d/l]
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by shmem (Chancellor) on May 12, 2007 at 09:51 UTC
Check the LANG setting at your server and at home, I bet they are different. Or it is that the web server's charset defaults to utf-8. Check the config. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^4: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 10:06 UTC
Re^5: HTML::TokeParser, get_text scrambling rsquo and lsquo by shmem (Chancellor) on May 12, 2007 at 10:49 UTC
Some notes below your chosen depth have not been shown here
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by Krambambuli (Curate) on May 12, 2007 at 14:05 UTC
As you mentioned that you just want to somehow process the incoming html and then pass it forward for other processing, it might be that something like the code below could help. That is, using UTF-8 for what goes out to STDOUT and re-encode the title that was decoded by get_text: #!/usr/bin/perl use strict; use warnings; use HTML::Entities; use HTML::TokeParser; binmode(STDOUT, ":utf8"); local $/; my $lines = <DATA>; my $tok_par = HTML::TokeParser->new( \$lines ); my $tok_inf = $tok_par->get_token ; my $tok_typ = shift @{$tok_inf}; print "Type: $tok_typ \n" ; my $title = $tok_par->get_text() \|\| "<NO TITLE FOUND>"; print "Title: $title \n" ; my $encoded_title = encode_entities( $title, '\x{ff}-\x{ffff}' ); print "Enc_Title: $encoded_title\n"; __END__ <title>egrave: è : eacute: é : rsquo: ’ : lsquo: & +lsquo;</title> [download]	[reply] [d/l]
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo by Anonymous Monk on May 12, 2007 at 09:43 UTC
`print "From HTML::Entities lsquo => chr(8216), rsquo => chr(8217)" +; print "\n"; print "From HTML::Entities lsquo => ",chr(8216),", rsquo => ", chr +(8217);` [download]	[reply] [d/l]
Re^4: HTML::TokeParser, get_text scrambling rsquo and lsquo by tridral (Initiate) on May 12, 2007 at 10:02 UTC