fanticla has asked for the wisdom of the Perl Monks concerning the following question:
Hi Monks,
I need to extract text from HTML sites. I am using HTML::Content::ContentExtractor. Quite happy about it, as it tries to extract only the body of a text, leaving out boilerplates, tags, scripts, etc.
Now the BIG problem: Encoding. As soon as I try to convert a non English site, it leaves out any character with accents, etc.
The basic script is (run for any HTML site I have in my $save_grabbing folder:
my $tokenizer = new HTML::Content::HTMLTokenizer('TAG','WORD'); my $ranker = new HTML::WordTagRatio::WeightedRatio(); my $extractor = new HTML::Content::ContentExtractor($tokenizer,$rank +er,"$save_grabbing/$html_counter_1.html","$save_grabbing/$html_counte +r_1.txt"); $extractor->Extract();
I think this problem is not easy to solve, but maybe you can give me some suggestions.
Thank you
Cla
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: ContentExtractor Coding
by Your Mother (Archbishop) on Jun 01, 2010 at 16:47 UTC | |
|
Re: ContentExtractor Coding
by almut (Canon) on Jun 01, 2010 at 16:24 UTC |