in reply to foreach (@array) s/x/y/ efficiency

First, an interesting point about the regex... Within a character class, \b matches a backspace, rather than a word-boundary. [\W\b] will match either a non-word character or a backspace character (which is a non-word character anyway).

I would actually use lookbehind and lookahead to make the replacement simpler: s,(?<!\w)($worda)(\W+)($wordb)(?!\w),<B>$1</B>$2<B>$3</B>,i Next, I'm trying to figure out what makes s|<B><B>|<B>|g; s|</B></B>|</B>|g; necessary. Because your regex only allows non-word characters between word A and word B, and <B> and </B> each contain a word character, once bold tags are put around a word that word should never be matched by your regex again. Ah... Unless your material may already contain some bold tags before you do any of the substitutions. Then you could end up with doubled tags to remove.
 

Finally, here's how I would try to do this more efficiently. I would combine @material into a single string, perform the substitutions, and then split back to @material.

my $material = join "\0", @material; foreach $phrase (@key_phrases) { my($worda, $wordb) = split / /, $phrase; $material =~ s{(?<!\w)($worda)([^\w\0]+)($wordb)(?!\w)} {<B>$1</B>$2<B>$3</B>}i; } @material = split /\0/, $material;
As you can see, I'm using "\0" as a temporary divider between pieces of @material; I've updated the regex to make sure matches don't overlap two pieces.

I considered also building a single regex to match all the key phrases, but since each phrase appears only once I don't know if that would be more efficient.

Replies are listed 'Best First'.
Re: Re: foreach (@array) s/x/y/ efficiency
by gryphon (Abbot) on Jan 11, 2001 at 02:43 UTC

    This ends up actually taking longer to run than the original. I think that's because the s/// has to fly through the entire string of $material, where the loop version (original) stops (last out of loop) when it hits the match. (Not that I would really know for sure what I'm talking about.)

      Could you provide the data you used to compare the different solutions? I really was expecting mine to be faster, so I'd like to figure out what I did wrong. Thanks!

        It's the book of Luke from the NIV Bible. Each element of @material is setup ^\d\t\d\tLuke\t$chapter\$verse\$verse_text\n. And @key_phrases is just each key phrase from the material with a single space between the two words. (ex. $key_phrases[6153] = "not immediately";)