Re: Tag protein names in sentences

Are you using perl v5.10.0 or later?

I suspect that you may be falling foul of a pathological case of the regex Trie optimisation. See 5.10.0 regex slowdown for some discussion.

You might find your performance improves if you include:

BEGIN{
  ${^RE_TRIE_MAXBUF} = 0;
}
[download]

at the top of your program.

Note: That's just one possibility that I draw from the referenced thread. Things may have move on since then.

Another possibility would be to break your 7 million alternations regex up into smaller ones. For example, you could group them into subsets keyed by the first "word", and build an alternation regex from each group. You'd then interate over the hash of re's and only invoke the full sub-re on the sentence if it contains the first word.

Something like:

@proteinNames = ...;

my %res;

m[^(\S+)] and push @{ $res{ $1 } }, $_ for @proteinNames;

$res{ $_ } = join '|', reverse sort map { quotemeta } @{ $res{ $_ } }
    for keys %res;
    
while( my $line = <> ) {
    for my $key ( keys %res ) {
        if( $line =~ $key ) {
            $line =~ s[($re{ $key })][\*\*$1\*\*]ig;
            print $line;
        }
    }
}
[download]

Whether that would help or hinder is impossible to say without testing it on real data.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"I'd rather go naked than blow up my ass"

Comment on Re: Tag protein names in sentences Select or Download Code