Re: Efficient selective substitution on list of words

Deepest apologies for having skipped over the part of the OP that BrowserUK has considerately placed into focus for me.

Now that I understand it correctly, I try again in a separate reply.

Pardon me if I'm jumping to conclusions, but it seems like your notion of "stopwords" is really just a matter making sure that the "word" string is not part of a larger word. If that's really all it amounts to, all you need is to put the \b assertion around each word:

my %edits = (
    score => 'twenty',
    core => 'center',
    centre => 'center',
    centres => 'centers',
    travelled => 'traveled',
    "hasn't" => 'has not',
    Johann => 'John' );

my $pattern = '\b(' . join( '|', keys %edits ) . ')\b';

while (<DATA>) {
    s/$pattern/$edits{$1}/g;
    print;
}

__DATA__
fourscore and score years ago, we scored great scores with
apple cores.  it's time for an encore at the core of our cultural
centre.  in many centres where we travelled, Johann hasn't
scored as well as he did in Johannesburg, where his score
against Johannes Brahms shook us to our cores.
[download]

The example data there points out a couple issues you may need to cope with using this approach:

spelling changes (e.g. "centre" to "center") will need to be specified for all inflected/derived forms ("centres", "centred", "centring") due to the use of the \b assertions
some replacements will be inappropriate due to ambiguous usage (e.g. "score" may be used in a context where it does not mean "twenty")
some replacements might produce awkward results (e.g. "core centre" becomes "center center") -- maybe that's a stretch, but it's relevant to the example that you provided.

But depending on the actual set of replacements you need to do, those issues are likely to be less bothersome than the problem of trying to figure out all the "stopwords" you would need to specify in order to avoid incorrect replacements within larger words.

In any case, the exercise as a whole really should be "previewed" or "monitored": for a given set of replacements and input data, get a listing of all the matches in the data, and/or review all changes applied by the process, to confirm that all changes are as intended. If you really are dealing with "natural language" data here, it pays to be really careful.

Comment on Re: Efficient selective substitution on list of words Select or Download Code

Replies are listed 'Best First'.
Re^2: Efficient selective substitution on list of words by BrowserUk (Patriarch) on Jan 31, 2010 at 17:15 UTC
The target language is Asian, where 1) there are no spaces between words;	[reply]