in reply to ContentExtractor Coding
it leaves out any character with accents, etc.
A quick look at the module's source reveals that there is a line (TokeParserTokenizer.pm:61)
#Remove HTML directives $text =~ s/\Q&\E.*?\Q;\E/ /g;
which replaces any HTML character entity references like ä with a space (the extra quoting of '&' and ';' in that regex is superfluous, btw, but that's another issue). I.e., it would turn content like "äbçdèfoo" into " b d foo".
You could try commenting out that line. Or maybe replace it with
decode_entities($text);
and add use HTML::Entities; near the top of that package. That should turn the above content string into "äbçdèfoo".
|
|---|