Re: ContentExtractor Coding

it leaves out any character with accents, etc.

A quick look at the module's source reveals that there is a line (TokeParserTokenizer.pm:61)

    #Remove HTML directives
    $text =~ s/\Q&\E.*?\Q;\E/ /g;
[download]

which replaces any HTML character entity references like ä with a space (the extra quoting of '&' and ';' in that regex is superfluous, btw, but that's another issue). I.e., it would turn content like "äbçdèfoo" into " b d foo".

You could try commenting out that line. Or maybe replace it with

    decode_entities($text);
[download]

and add use HTML::Entities; near the top of that package. That should turn the above content string into "äbçdèfoo".

Comment on Re: ContentExtractor Coding Select or Download Code