in reply to Tag protein names in sentences
Are you using perl v5.10.0 or later?
I suspect that you may be falling foul of a pathological case of the regex Trie optimisation. See 5.10.0 regex slowdown for some discussion.
You might find your performance improves if you include:
BEGIN{ ${^RE_TRIE_MAXBUF} = 0; }
at the top of your program.
Note: That's just one possibility that I draw from the referenced thread. Things may have move on since then.
Another possibility would be to break your 7 million alternations regex up into smaller ones. For example, you could group them into subsets keyed by the first "word", and build an alternation regex from each group. You'd then interate over the hash of re's and only invoke the full sub-re on the sentence if it contains the first word.
Something like:
@proteinNames = ...; my %res; m[^(\S+)] and push @{ $res{ $1 } }, $_ for @proteinNames; $res{ $_ } = join '|', reverse sort map { quotemeta } @{ $res{ $_ } } for keys %res; while( my $line = <> ) { for my $key ( keys %res ) { if( $line =~ $key ) { $line =~ s[($re{ $key })][\*\*$1\*\*]ig; print $line; } } }
Whether that would help or hinder is impossible to say without testing it on real data.
|
|---|