fletcher_the_dog has asked for the wisdom of the Perl Monks concerning the following question:

I have a project where I need to extract names with a particular last name from a large set of text files. I am using a regex like this to extract the names:
foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(?:Jones|Rogers +|Edwards|Smith|Jackson))/sg){ print $name."\n"; }
The last names have been changed to protect the innocent :-) In my real code there are about 15 last names I am looking for. This regex works fine, My question is: Is there a more efficient way to check for one of the last names than to enumerate them like I have done? Obviously, I will have to enumerate them somewhere, but it seems like my regex is rather inefficient as it has to walk through a long list. I have tried to think of a way to incorporate some kind of hash look up into the regex, but I couldn't figure out a way to do it.

Replies are listed 'Best First'.
Re: Efficiency in regex
by Paladin (Vicar) on Dec 28, 2002 at 03:00 UTC
    If you wanted to use a hash lookup, you could always pull the last name out of the string with a regex, then use that to look it up in the hash. Something like:
    # untested my @last_names{qw/Jones Rogers/} = (); foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(\w+))/sg) { next unless exists $last_names{$2}; print $name }
      Or make an array of qr'ed regexes that you can loop on on each line if the data files are large and you don't want to slurp. However, I think if you benchmark it one qred or list like you have orig will still be faster.
      Edited to add: One more note if you know only one match can happen per line the array of qr'ed regex may be faster if you last on the first match. But I suspect you will not be able to guarentee that case. Also how are you planning on dealing with mixed use names? for example, Stuart. Consider:
      blah la la blah the end, Stuart Bishop is 10 feet tall. and la blah bla la, Ross Stuart is kinda short.

      Unless you do some complicated magic you are going to potentally get false matches ("end," in this case for the first name).

      -Waswas

        No. Paladin's solution has 2 operations for each piece of data: 1 match, 2 lookup. The original solution has anywhere from 1 to n (in this case, n=15) operations for each piece of data, depending on how soon the item matches. Paladin's is a constant O(2), while the original will average around O(n/2). A test:

        use Benchmark; my %names; my (@list) = qw(Jones Rogers Edwards Smith Jackson Ryan Jones tilly dws paladin footpad jeffa Elian ybiC TheDamian ); @names{@list} = (1) x @list; my $names = join '|', @list; my $data = do {local $/; <DATA>}; timethese ( 100_000, { "paladin" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(\w ++))/go){ "$name\n" if exists $names{$name} } }, "original" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(?: +$names))/sgo){ "$name\n" } } }); __DATA__ Dr. Happy Sr. Rogers Senoir. Chacho Senoira. Chachese Mr. Ryan Mrs. Smith (I'm sorry) Ms. Jackson (oooh, I am for reaaal) Dr. Tilly Mr. Elian Asdokfj. adfsdf Ms. asdfasdf Mr. Burns Qsdokfj. adfsdf q. TheDamian Hello. There This. Should Not. Fail

        And the results:

        Benchmark: timing 100000 iterations of optimized, original, paladin... original: 25 wallclock secs (23.14 usr + 0.00 sys = 23.14 CPU) @ 43 +20.77/s (n=100000) paladin: 18 wallclock secs (16.93 usr + 0.00 sys = 16.93 CPU) @ 59 +05.63/s (n=100000)

        In response to your update, I think you are mistaken; "end," doesn't match anywhere at all.

      I like the idea behind this solution, but this won't work in an instance like this: "Tom Jones is here". $2 will be the word "is" which will not be in the hash.
Re: Efficiency in regex
by Zaxo (Archbishop) on Dec 28, 2002 at 03:51 UTC

    A few pretty much orthogonal bits.

    • Order the list of surnames by frequency, commonest first.
    • Think about index. The surnames are fixed strings.
    • @foo = substr( $_, $tart, $offset) =~ /$re/; has a lot to recommend it.
    • The m//sg form suggests you are slurping the file. Unless the records are messy, it's often better to process each record as it's read.
    • Benchmark. Regex and search code performance is subject to subtlety, and small things can make a big difference.

    After Compline,
    Zaxo

Re: Efficiency in regex
by bart (Canon) on Dec 28, 2002 at 16:52 UTC
    Is there a more efficient way to check for one of the last names than to enumerate them like I have done?
    Sure. There's a module available which will look for the common prefixes and suffixes in the alternatives, and that way, create a faster regex: Regex::PreSuf
Re: Efficiency in regex
by gmpassos (Priest) on Dec 28, 2002 at 18:09 UTC
    You can build the regex on runtime without eval():
    my @names = qw(Jones Rogers Edwards Smith Jackson) ; my $names_re = join("|", @names) ; foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(?:$names_re))/ +sg){ print $name."\n"; }

    Graciliano M. P.
    "The creativity is the expression of the liberty".