fanticla has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I need to extract text from HTML sites. I am using HTML::Content::ContentExtractor. Quite happy about it, as it tries to extract only the body of a text, leaving out boilerplates, tags, scripts, etc.

Now the BIG problem: Encoding. As soon as I try to convert a non English site, it leaves out any character with accents, etc.

The basic script is (run for any HTML site I have in my $save_grabbing folder:

my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD'); my $ranker = new HTML::WordTagRatio::WeightedRatio(); my $extractor = new HTML::Content::ContentExtractor($tokenizer,$rank +er,"$save_grabbing/$html_counter_1.html","$save_grabbing/$html_counte +r_1.txt"); $extractor->Extract();

I think this problem is not easy to solve, but maybe you can give me some suggestions.

Thank you

Cla

Replies are listed 'Best First'.
Re: ContentExtractor Coding
by Your Mother (Archbishop) on Jun 01, 2010 at 16:47 UTC
    I think this problem is not easy to solve, but maybe you can give me some suggestions.

    I think anything like this, which is obviously useful and something someone else must have done before, is almost certainly easy in Perl. You just have to pick the right tool and anything that dumps entities is definitely the wrong tool.

    Try this-

    use warnings; use strict; use WWW::Mechanize; use Encode; my $mech = WWW::Mechanize->new(agent => "NotSoForbiddenBot/0.99"); $mech->get("http://en.wikipedia.org/wiki/German_language"); # You might have to edit/detect the encode statement to # match the document's. print encode("UTF-8", $mech->content(format => "text"));

    That might be losing too much formatting and spacing for you. You can adapt this recipe on the raw HTML instead: Re: Strip HTML tags again. It will probably preserve white-space (outside of tables anyway) better.

Re: ContentExtractor Coding
by almut (Canon) on Jun 01, 2010 at 16:24 UTC
    it leaves out any character with accents, etc.

    A quick look at the module's source reveals that there is a line (TokeParserTokenizer.pm:61)

    #Remove HTML directives $text =~ s/\Q&\E.*?\Q;\E/ /g;

    which replaces any HTML character entity references like ä with a space (the extra quoting of '&' and ';' in that regex is superfluous, btw, but that's another issue).  I.e., it would turn content like "äbçdèfoo" into " b d foo".

    You could try commenting out that line.  Or maybe replace it with

    decode_entities($text);

    and add use HTML::Entities; near the top of that package.  That should turn the above content string into "äbçdèfoo".