in reply to Efficiency in regex

If you wanted to use a hash lookup, you could always pull the last name out of the string with a regex, then use that to look it up in the hash. Something like:
# untested my @last_names{qw/Jones Rogers/} = (); foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(\w+))/sg) { next unless exists $last_names{$2}; print $name }

Replies are listed 'Best First'.
Re: Re: Efficiency in regex
by waswas-fng (Curate) on Dec 28, 2002 at 03:16 UTC
    Or make an array of qr'ed regexes that you can loop on on each line if the data files are large and you don't want to slurp. However, I think if you benchmark it one qred or list like you have orig will still be faster.
    Edited to add: One more note if you know only one match can happen per line the array of qr'ed regex may be faster if you last on the first match. But I suspect you will not be able to guarentee that case. Also how are you planning on dealing with mixed use names? for example, Stuart. Consider:
    blah la la blah the end, Stuart Bishop is 10 feet tall. and la blah bla la, Ross Stuart is kinda short.

    Unless you do some complicated magic you are going to potentally get false matches ("end," in this case for the first name).

    -Waswas

      No. Paladin's solution has 2 operations for each piece of data: 1 match, 2 lookup. The original solution has anywhere from 1 to n (in this case, n=15) operations for each piece of data, depending on how soon the item matches. Paladin's is a constant O(2), while the original will average around O(n/2). A test:

      use Benchmark; my %names; my (@list) = qw(Jones Rogers Edwards Smith Jackson Ryan Jones tilly dws paladin footpad jeffa Elian ybiC TheDamian ); @names{@list} = (1) x @list; my $names = join '|', @list; my $data = do {local $/; <DATA>}; timethese ( 100_000, { "paladin" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(\w ++))/go){ "$name\n" if exists $names{$name} } }, "original" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.|[a-z]+)\s+)+(?: +$names))/sgo){ "$name\n" } } }); __DATA__ Dr. Happy Sr. Rogers Senoir. Chacho Senoira. Chachese Mr. Ryan Mrs. Smith (I'm sorry) Ms. Jackson (oooh, I am for reaaal) Dr. Tilly Mr. Elian Asdokfj. adfsdf Ms. asdfasdf Mr. Burns Qsdokfj. adfsdf q. TheDamian Hello. There This. Should Not. Fail

      And the results:

      Benchmark: timing 100000 iterations of optimized, original, paladin... original: 25 wallclock secs (23.14 usr + 0.00 sys = 23.14 CPU) @ 43 +20.77/s (n=100000) paladin: 18 wallclock secs (16.93 usr + 0.00 sys = 16.93 CPU) @ 59 +05.63/s (n=100000)

      In response to your update, I think you are mistaken; "end," doesn't match anywhere at all.

Re: Re: Efficiency in regex
by fletcher_the_dog (Friar) on Dec 30, 2002 at 15:25 UTC
    I like the idea behind this solution, but this won't work in an instance like this: "Tom Jones is here". $2 will be the word "is" which will not be in the hash.