Optimising Lingua::EN::NamedEntity for Very Strings

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using some modules to parse my mbox files, but found that certain messages caused a CPU spike and caused the process to hang. I narrowed the problem down to Lingua::EN::NamedEntity, which one of the modules uses internally. It was choking on a message with a large attachment, which, for the uninitiated, consists of many, many lines like

B/cCltBeBOMyzktNthjoXjIHOCsJvMkKk2u1Tcjlo6mAiwJmhwN6FT9iL...

(I'd been removing the attachments before passing them to Lingua::EN::NamedEntity, but that one was corrupted, so remained inline).

It strikes me that Lingua::EN::NamedEntity could be modified to better handle garbage input such as this, but I'm not sure of the best approach. Strings over are a certain length just aren't useful for entity extraction, IMO. Any suggestions so I can send the maintainer a patch?

Comment on Optimising Lingua::EN::NamedEntity for Very Strings Download Code

Replies are listed 'Best First'.
Re: Optimising Lingua::EN::NamedEntity for Very Strings by EdwardG (Vicar) on May 10, 2006 at 14:19 UTC
Any suggestions so I can send the maintainer a patch? Perhaps parameterisation, as in `use Lingua::EN::NamedEntity; my @entities = extract_entities($some_text, $max_string_length);` [download] or filter the output (but not solve your problem) `my @entities = extract_entities($some_text, $max_entity_length);` [download] A reasonable default for either option might be 92 characters, which would accomodate a variant spelling of the name of a hill in my country of origin; Tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu (link goes to image).	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Optimising Lingua::EN::NamedEntity for Very Strings
by EdwardG (Vicar) on May 10, 2006 at 14:19 UTC

Any suggestions so I can send the maintainer a patch?

Perhaps parameterisation, as in

      use Lingua::EN::NamedEntity;
      my @entities = extract_entities($some_text, $max_string_length);
[download]

      my @entities = extract_entities($some_text, $max_entity_length);
[download]

A reasonable default for either option might be 92 characters, which would accomodate a variant spelling of the name of a hill in my country of origin;

Tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu (link goes to image).

[reply]
[d/l]
[select]