Re: Optimizing regular expressions

As one of the resident regexiacs, I'm going to spend a lot of time examining your program.

Just so you know, if I had three strings, $word, $ig_first, and $ig_last, which held the characters for each of those three classes, this is how I would construct a regex to match words from a text stream:

#!/usr/bin/perl -wl

### this code assumes that there are no characters
### in the "ignore_last" class that AREN'T in the
### "word" class -- it might seem silly that there
### would be, but still, that's how I'm coding this  
  
use strict;
  
my $text_stream = q{foo#&#bar};

my $ig_first = '#'; 
my $ig_last = '';
my $word = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#';
    
    
my $pre    = length($ig_first) ? qr/[\Q$ig_first\E]*/ : '';
my $post   = length($ig_last)  ? qr/[\Q$ig_last\E]*/  : '';
my $inside = length($ig_last)  ? qr/[\Q$ig_last\E]+/  : '';
my ($match, @words);
      
         
{
  # remove chars from $word that are matched by $post
  my $reg = $word;
  $reg =~ s/$post+//g if $post;
  $reg = qr/[\Q$reg\E]/;
 

  # unroll the loop:
  $match = qr{
    ($pre)        # pre chars (save to $1)

    (             # (save to $2)
      $reg+       # one or more regular chars

      |           # OR

      $reg*       # zero or more regular chars
      (?:
        $inside   # one or more post chars
        $reg+     # one or more non-post chars
      )+          # this chunk one or more times
    )
  }x;             # /x for extended mode
}

    
$text_stream =~ s[$match]{ push @words, $2; "$1<b>$2</b>" }eg;

print $text_stream;
print "words: @words";
[download]

</code>

japhy -- Perl and Regex Hacker

Comment on Re: Optimizing regular expressions Download Code

Replies are listed 'Best First'.
Re: Re: Optimizing regular expressions by moseley (Acolyte) on Jun 02, 2001 at 02:14 UTC
I appreciate the help, and I'm sorry for such a general question. One thing to make clear is that the ignore_first and ignore_last characters are subsets of wordchars. The point of those settings is to allow characters within a word, but not at the start/end. The classic example is that a dot is ok within a host name, but not at the end (e.g. at the end of a sentence). The other thing that complicates this a bit is I'm not showing all the original text with the words highlighted in the final output, but rather just a few words on either side of the highlighted text. Like a google search shows. Thus, I also need to be careful not to print words twice when highlighted words are close together. That is the reason I split the source text into an array of "swish" words and non swish words -- so I could easily mark words on either side of the matched word. `@words = split /([^$wordchars])/, $source_text;<p>` [download] I use two arrays to track this now (instead of an array of arrays) as it was faster to avoid all the dereferencing. Here's a specific question: The "problem" with the above code is that I still need to remove the leading and trailing ignore characters. So, that's an extra pattern match for every word - one time for the split, and then another to extract out a word from its ignore chars. I tried to find an expression to use in the above spilt that would do this in one shot, but it was looking like a complicated expression that might be slower than doing two matches. But I never found a pattern that I could test. I also wonder if using a repeating pattern with /g might be faster than my word-by-word checking. But then I'm back to the problem of how to print the words around the match. Thanks again for your help.	[reply] [d/l]
Re: Re: Re: Optimizing regular expressions by japhy (Canon) on Jun 02, 2001 at 18:41 UTC
I wrote up a module you might find useful. Let me know if it, or the test program, needs some documentation. It appears to be quite nifty. Swish.pm `japhy` -- Perl and Regex Hacker	[reply]
Re: Optimizing regular expressions by moseley (Acolyte) on Jun 02, 2001 at 23:09 UTC
Interesting. Needed a few tweaks to compile (maybe I grabbed an early version). It also drops the non-wordchars in output, and I couldn't get the context to work when matches were closer than BEFORE and AFTER settings. I'll spend some time with your code later, as it's an interesting approach. I like your coding style, too! Anyway, you can see that this not a trivial problem to solve... I'll keep checking back. Thanks again, BTW -- did you try running the code I posted?	[reply]
Swish module (was Re: Re: Optimizing regular expressions) by japhy (Canon) on Jun 03, 2001 at 04:07 UTC
Re: Swish module (was Re: Re: Optimizing regular expressions) by moseley (Acolyte) on Jun 03, 2001 at 08:05 UTC
Some notes below your chosen depth have not been shown here