Re: Efficient selective substitution on list of words

Here's an approach based on regexes that might be of interest. It has some problems (addressed in Update), and it might be better to pursue the design of a parser for this application. I don't think it should be much of a problem to compile Unicode regexes, although I must admit I have no experience in this area.

The problems I see (at the moment) are:

Stop word prefixes and suffixes are analyzed entirely independently, so that core in the (nonsense) word encoregent (last line of test text below) is not translated even though the prefix en and suffix gent are not associated with each other in the stop word list. Update: In fact, encore stops translation of core. (Actually, now I think of it, this aspect of the regex could probably be fixed fairly easily, but at the cost of even slower speed. Maybe update later... See Update.)
A word like cores (last line of test text) must be specified separately for translation, with its own stop word list. (Uncomment the specification for cores in the @trans array to see the effect.)
A certain amount of confusion is introduced by the fact that the test text uses spaces to delimit words, and spaces, if I correctly understand the OP, would be essentially absent in the actual text. However, I couldn't get my mind around a total absence of whitespace!
I wouldn't be surprised to find that this approach is as slow as molasses in January (hey...!) for translating hundreds of thousands or millions of words.
If there is a very large number of words (and stop words) to translate, the final regex ($translate below) will be monstrous.
Probably some more I (and others) will think of later...

Anyhoo, FWIW, here it is.

use warnings;
use strict;

use List::MoreUtils qw(uniq);


my @trans = (
    #  substitute... for...        except in...
    [ 'TWENTY',     'score',    qw(fourscore scored scores)       ],
    [ 'CENTER',     'core',     qw(encore encores coregent score) ],
  # [ 'CENTERS',    'cores',    qw(encores scores)                ],
    [ 'JOHN',       'Johann',   qw(Johannesburg)                  ],
    [ 'CENTER',     'centre',    ],
    [ 'TRAVELED',   'travelled', ],
    [ 'HAS NOT',    'hasn\'t',   ],
    );

my %trans = map @{ $_ }[1, 0], @trans;

my ($translate) =
    map qr{ $_ }xms,
    join ' | ',
    map word_regex(@{ $_ }[1 .. $#{$_}]),
    @trans
    ;

while (defined(my $line = <DATA>)) {
    print $line;
    $line =~ s{ ($translate) }{$trans{$1}}xmsg;
    print $line;
    print "\n";
    }


sub word_regex {

    my ($word,
        @stops,
        ) = @_;

    my $not_prefix =                # 5. conjunction
        join ') (?<! ',             # 4. neg. look-behind...
        uniq                        # 3. no dups...
        map m{ \A (.+) $word }xms,  # 2. extract prefixes...
        @stops                      # 1. for any stop words...
        ;
    $not_prefix = "(?<! $not_prefix)" if $not_prefix;  # final wrap

    my $not_suffix =
        join ' | ',
        uniq
        map m{ $word (.+) \z }xms,
        @stops
        ;
    $not_suffix = "(?! $not_suffix)" if $not_suffix;

    return qr{ $not_prefix $word $not_suffix }xms;

    }


__DATA__
the core of the coregents encores scored fourscore score of
scores when Johann travelled to Johannesburg
for a score of cores hasn't a centre encoregent
[download]

Output:

>perl selective_trans_1.pl
the core of the coregents encores scored fourscore score of
the CENTER of the coregents encores scored fourscore TWENTY of

scores when Johann travelled to Johannesburg
scores when JOHN TRAVELED to Johannesburg

for a score of cores hasn't a centre encoregent
for a TWENTY of cores HAS NOT a CENTER encoregent
[download]

Update: A more-better version (replaces previous, less-better version). It addresses item 1 in the problem list above, and sorting by size as Polyglot suggests (apparently per tye) addresses item 2. It demonstrates the use of regexes in specifying the stop word list. (And expanding the function definitions inline would also produce the one-liner Polyglot wants. Win-win!) Although not what Polyglot wants or needs, I think this is a neat approach. I enjoyed working this problem, so thanks to Polyglot.

use warnings;
use strict;


my @translate = (
    #  insert...   for...         except in...
    [ 'TWENTY',   'score',     qw(twoscore unscored? score[srd])    ],
    [ 'CENTER',   'core',      qw(encore[sd]? score[sd]? core[srd]) ],
    [ 'CENTERS',  'cores',     qw(encores scores)                   ],
    [ 'JOHN',     'Johann',    qw(Johannesburg)                     ],
    [ 'CENTER',   'centre',    ],
    [ 'TRAVELED', 'travelled', ],
    [ 'HAS NOT',  'hasn\'t',   ],
    );

my %replace = map @{ $_ }[1, 0], @translate;

my $search =
    join ' | ',
    map  word_regex(@{ $_ }[1 .. $#{$_}]),
    sort { $b->[1] cmp $a->[1] }  # longest words first
    @translate
    ;

while (defined(my $line = <DATA>)) {
    print $line;
    $line =~ s{ ($search) }{$replace{$1}}xmsg;
    print $line;
    print "\n";
    }


sub word_regex {

    my ($word,
        @stops,
        ) = @_;

    my $not_stopped =
        join ' ',
        map  not_stopped(@$_),
        map  [ m{ \A (.*) ($word) (.*) \z }xms ],
        @stops
        ;

    return "$not_stopped $word";

    }


sub not_stopped {

    my ($stop_prefix,  # always defined if word defined, maybe empty
        $word,         # word embedded in stop word
        $stop_suffix,  # always defined if word defined, maybe empty
        ) = @_;

    return '' unless defined $word and length $word;

    # convert word to placeholder (faster match?)
    $word = sprintf '.{%d}', length $word;

    # convert stop prefix, if any, to POSITIVE assertion.
    $stop_prefix = "(?<= $stop_prefix)" if length $stop_prefix;

    # NEGATIVE assert of stop prefix, word placeholder, stop suffix.
    return "(?! $stop_prefix $word $stop_suffix)";

    }


__DATA__
a score and twoscore of unscored scorer won't unscore scored scores
at the core of the encore that was scored and cored
for many scores of encores Johann travelled to Johannesburg
for a score of cores hasn't a stercorean centre
[download]

Output:

>perl selective_trans_1.pl
a score and twoscore of unscored scorer won't unscore scored scores
a TWENTY and twoscore of unscored scorer won't unscore scored scores

at the core of the encore that was scored and cored
at the CENTER of the encore that was scored and cored

for many scores of encores Johann travelled to Johannesburg
for many scores of encores JOHN TRAVELED to Johannesburg

for a score of cores hasn't a stercorean centre
for a TWENTY of CENTERS HAS NOT a sterCENTERan CENTER
[download]

Comment on Re: Efficient selective substitution on list of words Select or Download Code