Are you using perl v5.10.0 or later?

I suspect that you may be falling foul of a pathological case of the regex Trie optimisation. See 5.10.0 regex slowdown for some discussion.

You might find your performance improves if you include:

BEGIN{ ${^RE_TRIE_MAXBUF} = 0; }

at the top of your program.

Note: That's just one possibility that I draw from the referenced thread. Things may have move on since then.

Another possibility would be to break your 7 million alternations regex up into smaller ones. For example, you could group them into subsets keyed by the first "word", and build an alternation regex from each group. You'd then interate over the hash of re's and only invoke the full sub-re on the sentence if it contains the first word.

Something like:

@proteinNames = ...; my %res; m[^(\S+)] and push @{ $res{ $1 } }, $_ for @proteinNames; $res{ $_ } = join '|', reverse sort map { quotemeta } @{ $res{ $_ } } for keys %res; while( my $line = <> ) { for my $key ( keys %res ) { if( $line =~ $key ) { $line =~ s[($re{ $key })][\*\*$1\*\*]ig; print $line; } } }

Whether that would help or hinder is impossible to say without testing it on real data.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"I'd rather go naked than blow up my ass"

In reply to Re: Tag protein names in sentences by BrowserUk
in thread Tag protein names in sentences by sinlam

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.