gryphon has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I have an array called @material that contains a text file. I've also got an array called @key_phrases where each element is a string with two words in it, seperated by a space. Each element represents a two-word phrase in the text file that exists only once, thereby making it "key."

What I want to do is find the most efficient way, without loosing the structure of @material, of finding and "marking" each key words pair with HTML bold tagging. What I've come up with works, but it's painfully slow.

my $phrase; foreach $phrase (@key_phrases) { my($worda,$wordb) = split(/ /, $phrase); foreach (@material) { last if (s|([\W\b])($worda)(\W+)($wordb)([\W\b])| $1<B>$2</B>$3<B>$4</B>$5|i); } } # Here's where I clean up redundant tags... foreach (@material) { s|<B><B>|<B>|g; s|</B></B>|</B>|g; s|</B> <B>| |g; }

Any suggestions on how to get this to run faster? It annoys me that I have to loop through @material each time I need to find a @key_phrases element. By definition, the key phrase will only happen once in @material, so I could do some kind of grep as long as I can put the altered element back into @material in the right order.

Oh, one other thing, I'm running this on a Win32 box (not by choice), so I'm somewhat restricted in what UNIX command-line features I can use.

Thanks in advance. :)

-Gryphon.

Replies are listed 'Best First'.
Re: foreach (@array) s/x/y/ efficiency
by chipmunk (Parson) on Jan 10, 2001 at 08:06 UTC
    First, an interesting point about the regex... Within a character class, \b matches a backspace, rather than a word-boundary. [\W\b] will match either a non-word character or a backspace character (which is a non-word character anyway).

    I would actually use lookbehind and lookahead to make the replacement simpler: s,(?<!\w)($worda)(\W+)($wordb)(?!\w),<B>$1</B>$2<B>$3</B>,i Next, I'm trying to figure out what makes s|<B><B>|<B>|g; s|</B></B>|</B>|g; necessary. Because your regex only allows non-word characters between word A and word B, and <B> and </B> each contain a word character, once bold tags are put around a word that word should never be matched by your regex again. Ah... Unless your material may already contain some bold tags before you do any of the substitutions. Then you could end up with doubled tags to remove.
     

    Finally, here's how I would try to do this more efficiently. I would combine @material into a single string, perform the substitutions, and then split back to @material.

    my $material = join "\0", @material; foreach $phrase (@key_phrases) { my($worda, $wordb) = split / /, $phrase; $material =~ s{(?<!\w)($worda)([^\w\0]+)($wordb)(?!\w)} {<B>$1</B>$2<B>$3</B>}i; } @material = split /\0/, $material;
    As you can see, I'm using "\0" as a temporary divider between pieces of @material; I've updated the regex to make sure matches don't overlap two pieces.

    I considered also building a single regex to match all the key phrases, but since each phrase appears only once I don't know if that would be more efficient.

      This ends up actually taking longer to run than the original. I think that's because the s/// has to fly through the entire string of $material, where the loop version (original) stops (last out of loop) when it hits the match. (Not that I would really know for sure what I'm talking about.)

        Could you provide the data you used to compare the different solutions? I really was expecting mine to be faster, so I'd like to figure out what I did wrong. Thanks!
Re: foreach (@array) s/x/y/ efficiency
by merlyn (Sage) on Jan 10, 2001 at 06:12 UTC
    Your first basic optimization is "compile each regex only once". So get thee over to perlre and learn about qr.

    Once you've tackled that, if it's still not fast enough, you can probably turn your algorithm into a state machine so that you don't generate redundant B elements in the first place.

    -- Randal L. Schwartz, Perl hacker

      To say that this made things extremely faster would be a complete understatement. After being blown away at first, I ran a pseudo-benchmark, and noted that it cut my run time (on that section of code only) by a factor of at least 14! Thank you!

      Just for the sake of my personal education, here is what I came up with based on your suggestion. Did I get it right? Or is there an even better way?

      foreach $phrase (@key_phrases) { my($worda,$wordb) = split(/ /, $phrase); my $pattern = qr/(\W|\b)($worda)(\W+)($wordb)(\W|\b)/; foreach (@material) { last if (s,$pattern,$1<U>$2</U>$3<U>$4</U>$5,i); } }

      Thanks again for your help and wisdom, great one.

      -Gryphon.

Re: foreach (@array) s/x/y/ efficiency
by repson (Chaplain) on Jan 10, 2001 at 08:51 UTC
    It might be nice if you could do this, since index and substr should be faster than regexen:
    $text = join("\n",@material); for my $phrase (@key_phrases) { my $pos = index $text, $phrase; if ($pos == -1) { warn "Phrase: $phrase not found\n"; } substr($text, $pos, 0) = '<B>'; substr($text, ($pos+length($phrase)), 0) = '</B>'; }
    However you quite likely need to ignore case, and judging by your code you have other conditions which this code won't do. Of course you might be able to fix that with use of lc and maybe some preprocessing, but if your doing that you might destroy any speed gains.

      Although index runs faster than s///, $phrase is actually two words that may have "stuff" between them (in @material) other than just a space (as they would have in $phrase). So this doesn't match a lot of the phrases. :(