Well, I've narrowed it down to ......
...
langof()
in
Lingua::Identify
Called from
Text::Compare
If I set the lang to 'en' in Text::Compare's get_words() subroutine, it runs at a steady ~24MB on Linux. Of course, the metric is now different, since it's not "correctly" (as if it ever really was) identifying the languages, but for my app that's not really a problem, since I don't care. As long as it doesn't skew the results badly, - and since this isn't the only metric I'm using, it's not looking that bad.
It's main effect is to cause the matches I get back to indicate more similarity than before. And, actually, I'd rather have more false positives - so it doesn't hurt too much.
I'll talk to the Lingua people next I think now....
Thanks for everyone's suggestions.
Kind regards
Derek.
| [reply] [d/l] [select] |
Dear Sam,
Thanks - that's useful to know. Wasn't sure if that was true with ithreads or not. I'll test that. Any idea on %age?
Kind regards
Derek. | [reply] |