in reply to ContentExtractor Coding

it leaves out any character with accents, etc.

A quick look at the module's source reveals that there is a line (TokeParserTokenizer.pm:61)

#Remove HTML directives $text =~ s/\Q&\E.*?\Q;\E/ /g;

which replaces any HTML character entity references like ä with a space (the extra quoting of '&' and ';' in that regex is superfluous, btw, but that's another issue).  I.e., it would turn content like "äbçdèfoo" into " b d foo".

You could try commenting out that line.  Or maybe replace it with

decode_entities($text);

and add use HTML::Entities; near the top of that package.  That should turn the above content string into "äbçdèfoo".