in reply to Re: HTML::TokeParser, get_text scrambling rsquo and lsquo
in thread HTML::TokeParser, get_text scrambling rsquo and lsquo

I think it's a good point about the desired output encoding. I'm only reading the html to produce another html file, so it would suit me just as well to read the text raw without interpreting the html codes. Is there something similar to get_text that just delivers the text without interpreting it first? I've not found anything like that when reading about TokeParser.

It may be that the problem can be solved by looking at character encodings as people have suggested, but in case that falls through, I'd also like to look at the possibility of reading uninterpreted html.

Thanks for your help

  • Comment on Re^2: HTML::TokeParser, get_text scrambling rsquo and lsquo

Replies are listed 'Best First'.
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo
by Joost (Canon) on May 12, 2007 at 10:57 UTC
    I don't see any method to get the "raw" text either.

    In any case, if the output encoding doesn't matter, just open the output file in utf8 mode and set the correct encoding in the html file (not needed if the file is on a webserver that sends the correct content-type w/ charset header for the file):

    open my $out,">:utf8",$filename or die $!; # print head start print $out q(<META HTTP-EQUIV="Content-Type" CONTENT="text/html; chars +et=UTF-8">); # print rest of head and document
Re^3: HTML::TokeParser, get_text scrambling rsquo and lsquo
by wfsp (Abbot) on May 13, 2007 at 10:13 UTC
    reading... html to produce... html
    I prefer to avoid jiggering about with the encoding too. In my experience it always ends in tears. :-)

    Here's what I do (I'm using HTML::TokeParser::Simple in this case).

    #!/usr/bin/perl use strict; use warnings; use lib 'lib'; use MyParser; my ($p, $txt); my $html = do{local $/;<DATA>}; $p = MyParser->new(file => 'test.html') or die "can't parse: $!\n"; $txt = $p->get_title; print "$txt\n"; $p->get_tag('p'); $txt = $p->get_txt('p'); # upto a closing p tag print "*$txt*\n"; __DATA__ <html> <head> <title>egrave: &egrave; : eacute : &eacute; : rsquo: &rsquo; : lsquo: +&lsquo;</title> </head> <body> <p> one <span class="second">two</span> three <br> four five six </p> </body> </html>
    MyParser.pm
    package MyParser; use strict; use warnings; use HTML::TokeParser::Simple; use base qw(HTML::TokeParser::Simple); sub get_title{ my ($self) = @_; $self->get_tag('title') or return; $self->get_txt('title'); } sub get_txt{ my ($self, $tag) = @_; my ($txt); while (my $t = $self->get_token){ last if $t->is_end_tag($tag); next if $t->is_start_tag or $t->is_end_tag; $txt .= $t->as_is if $t->is_text; } for ($txt){ s/\n/ /g; s/^\s+//; s/\s+$//; s/\s+/ /g; } return $txt; } 1;
    If, like me, getting the title is a frequent task you can add a method to the wrapper to do that as I've done here. I've also tried to emulate HTML::TokeParser's get_trimmed_text

    It skips any tags found before the required end tag (although this won't apply to titles).

    I'd be interested in hearing comments from monks if any glaring fopahs have been committed. :-)

      Thanks to everyone for their suggestions. Unfortunately I've not had time to go back over my code in order to take advantage of the monastic wisdom but I thought that I should at least stop here a while to offer thanks.