Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
I'm using some modules to parse my mbox files, but found that certain messages caused a CPU spike and caused the process to hang. I narrowed the problem down to Lingua::EN::NamedEntity, which one of the modules uses internally. It was choking on a message with a large attachment, which, for the uninitiated, consists of many, many lines like
B/cCltBeBOMyzktNthjoXjIHOCsJvMkKk2u1Tcjlo6mAiwJmhwN6FT9iL...
(I'd been removing the attachments before passing them to Lingua::EN::NamedEntity, but that one was corrupted, so remained inline).
It strikes me that Lingua::EN::NamedEntity could be modified to better handle garbage input such as this, but I'm not sure of the best approach. Strings over are a certain length just aren't useful for entity extraction, IMO. Any suggestions so I can send the maintainer a patch?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Optimising Lingua::EN::NamedEntity for Very Strings
by EdwardG (Vicar) on May 10, 2006 at 14:19 UTC |