What you propose to do sounds straightforward, but it really
won't be that simple to get right, though you should be
able to come up with something to make the task much easier. I hope that whatever method or combination of methods you
end up using, it is certain to miss things at least now an then,
so a review by a human editor is still likely to be necessary
in order to both ensure that
- You remove all identifying information. If you are taking steps
to remove identification, chances are that failing to do so in some
cases could lead to anything from embarassment to legal action. I don't know what
the source of your text is, but it could contain such things as
- addresses (knowing where someone could be almost as identifying as their name, and this
could be tricky as "Mr. XXXXX of 123 Main St., Wawa ON" would be bad but "Mr. XXXXX of YYY YYYY St., Wawa ON"
should be ok)
- phone numbers
- employee numbers
- e-mail addresses
- You don't remove things that are necessary for the text to be useful, such as:
- names of corporations, or other organizations (e.g. if the text were consumer complaints,
how useful would it be to end up with "Mr. XXX XXXXX was killed by a faulty corkscrew made by YYYYYY Corp.")
- names of public figures (e.g. before text might be, again depending on the source, something
like "Mary Smith says Senator Jones is a baboon because..." in which you may want to
hide Mary Smith but not the senator because (s)he is the subject of the text)
- probably lots of other stuff
Perhaps your text is from just one source you know to have
a strict content standard, and that may make things much
easier, but it would be worthwhile to consider what could go wrong.
--
I'd like to be able to assign to an luser