Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm using some modules to parse my mbox files, but found that certain messages caused a CPU spike and caused the process to hang. I narrowed the problem down to Lingua::EN::NamedEntity, which one of the modules uses internally. It was choking on a message with a large attachment, which, for the uninitiated, consists of many, many lines like

B/cCltBeBOMyzktNthjoXjIHOCsJvMkKk2u1Tcjlo6mAiwJmhwN6FT9iL...

(I'd been removing the attachments before passing them to Lingua::EN::NamedEntity, but that one was corrupted, so remained inline).

It strikes me that Lingua::EN::NamedEntity could be modified to better handle garbage input such as this, but I'm not sure of the best approach. Strings over are a certain length just aren't useful for entity extraction, IMO. Any suggestions so I can send the maintainer a patch?

Replies are listed 'Best First'.
Re: Optimising Lingua::EN::NamedEntity for Very Strings
by EdwardG (Vicar) on May 10, 2006 at 14:19 UTC

           Any suggestions so I can send the maintainer a patch?

    Perhaps parameterisation, as in

    use Lingua::EN::NamedEntity; my @entities = extract_entities($some_text, $max_string_length);
    or filter the output (but not solve your problem)
    my @entities = extract_entities($some_text, $max_entity_length);

    A reasonable default for either option might be 92 characters, which would accomodate a variant spelling of the name of a hill in my country of origin;

    Tetaumatawhakatangihangakoauaotamateaurehaeaturipukapihimaungahoronukupokaiwhenuaakitanarahu (link goes to image).