Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am creating a perl script to go through a text file and remove any identifying information (basically names and places in this first pass). The tack that I am going to take is to use some simple heuristics (Look for Mr., Mrs.,) etc. and then run each word through a dictionary (without a proper name list). Any word NOT found in the dictionary will get changed. My question: Is there a way to easily connect one of the open-source dictionaries (ispell) to perl to do this? Or is there a better way? TIA!

Replies are listed 'Best First'.
Re: Perl spell checker
by Albannach (Monsignor) on Oct 19, 2001 at 21:28 UTC
    What you propose to do sounds straightforward, but it really won't be that simple to get right, though you should be able to come up with something to make the task much easier. I hope that whatever method or combination of methods you end up using, it is certain to miss things at least now an then, so a review by a human editor is still likely to be necessary in order to both ensure that
    1. You remove all identifying information. If you are taking steps to remove identification, chances are that failing to do so in some cases could lead to anything from embarassment to legal action. I don't know what the source of your text is, but it could contain such things as
      • addresses (knowing where someone could be almost as identifying as their name, and this could be tricky as "Mr. XXXXX of 123 Main St., Wawa ON" would be bad but "Mr. XXXXX of YYY YYYY St., Wawa ON" should be ok)
      • phone numbers
      • employee numbers
      • e-mail addresses
    2. You don't remove things that are necessary for the text to be useful, such as:
      • names of corporations, or other organizations (e.g. if the text were consumer complaints, how useful would it be to end up with "Mr. XXX XXXXX was killed by a faulty corkscrew made by YYYYYY Corp.")
      • names of public figures (e.g. before text might be, again depending on the source, something like "Mary Smith says Senator Jones is a baboon because..." in which you may want to hide Mary Smith but not the senator because (s)he is the subject of the text)
      • probably lots of other stuff
    Perhaps your text is from just one source you know to have a strict content standard, and that may make things much easier, but it would be worthwhile to consider what could go wrong.

    --
    I'd like to be able to assign to an luser

Re: Perl spell checker
by Hofmator (Curate) on Oct 19, 2001 at 20:01 UTC
Re: Perl spell checker
by Dr. Mu (Hermit) on Oct 19, 2001 at 21:01 UTC
    Lingua::EN::Nameparse may also be able to locate names in your text. This isn't what it was designed to do, exactly, but I'd bet it could be pressed into sevice for your app.
Re: Perl spell checker
by thunders (Priest) on Oct 19, 2001 at 21:45 UTC
    UPDATE: added some code, i apologize in advance for the sorry-@$$ regexes within, I'm still learning

    I suggest splitting the text file up by sentences and x-ing out anything that is first letter capitalized and not at the start of a sentence. But then again, I don't know what kind of text file you are looking at. If there are any more specific requirements post them.
    this is an extremely simplified example of what I mean:

    $paragraph = "This is a group of words. It mentions people like Joe Sm +ith and Jill Doe who work at Aerodyne Laboratories, INC. The facility + is located in Springfield, MA and is famous for it's llamas, lizards + and Gork the giant robot."; @text = split(/\.[\s]*/,$paragraph); foreach $line(@text){ $line = lcfirst($line); $line =~ s/[A-Z][a-z]+/XXXXX/g; $line = ucfirst($line); } print join ". ", @text;

    produces:
    This is a group of words. It mentions people like XXXXX XXXXX and XXXX +X XXXXX who work at XXXXX XXXXX, INC. The facility is located in XXXX +X, MA and is famous for it's llamas, lizards and XXXXX the giant robo +t.
Re: Perl spell checker
by John M. Dlugosz (Monsignor) on Oct 20, 2001 at 02:13 UTC
    When I wrote Random Words, I downloaded the dictionary from Project Gutenberg, and processed the entries out of that with a Perl script.

    —John

Re: Perl spell checker
by Anonymous Monk on Oct 19, 2001 at 20:02 UTC
    Sorry...clarification. I will check for Mr. Blah Blah and replace it with Mr. XXXX XXXXX. Then I will run it through a spell check and replace any words not in the dictionary....such as Jane Drewey would get XXXXX XXXXXXX. Hope this makes more sense.

      I hope you don't have anything talking about people named `Smith', or `Wheeler', or even `Fletcher'. You probably want to see if it's not in the dictionary and/or if it's capitalized in the middle of the sentence. Also you'd have to watch carefully for acronyms that might not be all caps (e.g. `tcp socket', `nfs mount').