foreach (@array) s/x/y/ efficiency

gryphon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I have an array called @material that contains a text file. I've also got an array called @key_phrases where each element is a string with two words in it, seperated by a space. Each element represents a two-word phrase in the text file that exists only once, thereby making it "key."

What I want to do is find the most efficient way, without loosing the structure of @material, of finding and "marking" each key words pair with HTML bold tagging. What I've come up with works, but it's painfully slow.

my $phrase;
foreach $phrase (@key_phrases) {
     my($worda,$wordb) = split(/ /, $phrase);
     foreach (@material) {
          last if
          (s|([\W\b])($worda)(\W+)($wordb)([\W\b])|
               $1<B>$2</B>$3<B>$4</B>$5|i);
     }
}

# Here's where I clean up redundant tags...
foreach (@material) {
     s|<B><B>|<B>|g;
     s|</B></B>|</B>|g;
     s|</B> <B>| |g;
}
[download]

Any suggestions on how to get this to run faster? It annoys me that I have to loop through @material each time I need to find a @key_phrases element. By definition, the key phrase will only happen once in @material, so I could do some kind of grep as long as I can put the altered element back into @material in the right order.

Oh, one other thing, I'm running this on a Win32 box (not by choice), so I'm somewhat restricted in what UNIX command-line features I can use.

Thanks in advance. :)

-Gryphon.

Comment on foreach (@array) s/x/y/ efficiency Select or Download Code

Replies are listed 'Best First'.
Re: foreach (@array) s/x/y/ efficiency by chipmunk (Parson) on Jan 10, 2001 at 08:06 UTC
First, an interesting point about the regex... Within a character class, `\b` matches a backspace, rather than a word-boundary. `[\W\b]` will match either a non-word character or a backspace character (which is a non-word character anyway). I would actually use lookbehind and lookahead to make the replacement simpler: `s,(?<!\w)($worda)(\W+)($wordb)(?!\w),<B>$1</B>$2<B>$3</B>,i` Next, I'm trying to figure out what makes `s\|<B><B>\|<B>\|g; s\|</B></B>\|</B>\|g;` necessary. Because your regex only allows non-word characters between word A and word B, and `<B>` and `</B>` each contain a word character, once bold tags are put around a word that word should never be matched by your regex again. Ah... Unless your material may already contain some bold tags before you do any of the substitutions. Then you could end up with doubled tags to remove. Finally, here's how I would try to do this more efficiently. I would combine @material into a single string, perform the substitutions, and then split back to @material. `my $material = join "\0", @material; foreach $phrase (@key_phrases) { my($worda, $wordb) = split / /, $phrase; $material =~ s{(?<!\w)($worda)([^\w\0]+)($wordb)(?!\w)} {<B>$1</B>$2<B>$3</B>}i; } @material = split /\0/, $material;` [download] As you can see, I'm using `"\0"` as a temporary divider between pieces of @material; I've updated the regex to make sure matches don't overlap two pieces. I considered also building a single regex to match all the key phrases, but since each phrase appears only once I don't know if that would be more efficient.	[reply] [d/l] [select]
Re: Re: foreach (@array) s/x/y/ efficiency by gryphon (Abbot) on Jan 11, 2001 at 02:43 UTC
This ends up actually taking longer to run than the original. I think that's because the s/// has to fly through the entire string of `$material`, where the loop version (original) stops (`last` out of loop) when it hits the match. (Not that I would really know for sure what I'm talking about.)	[reply] [d/l] [select]
Re: Re: Re: foreach (@array) s/x/y/ efficiency by chipmunk (Parson) on Jan 11, 2001 at 03:36 UTC
Could you provide the data you used to compare the different solutions? I really was expecting mine to be faster, so I'd like to figure out what I did wrong. Thanks!	[reply]
Re: Re: Re: Re: foreach (@array) s/x/y/ efficiency by gryphon (Abbot) on Jan 11, 2001 at 04:15 UTC
Re: foreach (@array) s/x/y/ efficiency by merlyn (Sage) on Jan 10, 2001 at 06:12 UTC
Your first basic optimization is "compile each regex only once". So get thee over to perlre and learn about `qr`. Once you've tackled that, if it's still not fast enough, you can probably turn your algorithm into a state machine so that you don't generate redundant B elements in the first place. -- Randal L. Schwartz, Perl hacker	[reply]
Re: Re: foreach (@array) s/x/y/ efficiency by gryphon (Abbot) on Jan 11, 2001 at 03:28 UTC
To say that this made things extremely faster would be a complete understatement. After being blown away at first, I ran a pseudo-benchmark, and noted that it cut my run time (on that section of code only) by a factor of at least 14! Thank you! Just for the sake of my personal education, here is what I came up with based on your suggestion. Did I get it right? Or is there an even better way? `foreach $phrase (@key_phrases) { my($worda,$wordb) = split(/ /, $phrase); my $pattern = qr/(\W\|\b)($worda)(\W+)($wordb)(\W\|\b)/; foreach (@material) { last if (s,$pattern,$1<U>$2</U>$3<U>$4</U>$5,i); } }` [download] Thanks again for your help and wisdom, great one. -Gryphon.	[reply] [d/l]
Re: foreach (@array) s/x/y/ efficiency by repson (Chaplain) on Jan 10, 2001 at 08:51 UTC
It might be nice if you could do this, since index and substr should be faster than regexen: `$text = join("\n",@material); for my $phrase (@key_phrases) { my $pos = index $text, $phrase; if ($pos == -1) { warn "Phrase: $phrase not found\n"; } substr($text, $pos, 0) = '<B>'; substr($text, ($pos+length($phrase)), 0) = '</B>'; }` [download] However you quite likely need to ignore case, and judging by your code you have other conditions which this code won't do. Of course you might be able to fix that with use of lc and maybe some preprocessing, but if your doing that you might destroy any speed gains.	[reply] [d/l]
Re: Re: foreach (@array) s/x/y/ efficiency by gryphon (Abbot) on Jan 11, 2001 at 02:53 UTC
Although `index` runs faster than `s///`, `$phrase` is actually two words that may have "stuff" between them (in `@material`) other than just a space (as they would have in `$phrase`). So this doesn't match a lot of the phrases. :(	[reply] [d/l] [select]