Efficiency in regex

fletcher_the_dog has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficiency in regex by Paladin (Vicar) on Dec 28, 2002 at 03:00 UTC
If you wanted to use a hash lookup, you could always pull the last name out of the string with a regex, then use that to look it up in the hash. Something like: `# untested my @last_names{qw/Jones Rogers/} = (); foreach my $name ($text=~/(\b(?:[A-Z](?:\.\|[a-z]+)\s+)+(\w+))/sg) { next unless exists $last_names{$2}; print $name }` [download]	[reply] [d/l]
Re: Re: Efficiency in regex by waswas-fng (Curate) on Dec 28, 2002 at 03:16 UTC
Or make an array of qr'ed regexes that you can loop on on each line if the data files are large and you don't want to slurp. However, I think if you benchmark it one qred or list like you have orig will still be faster. Edited to add: One more note if you know only one match can happen per line the array of qr'ed regex may be faster if you last on the first match. But I suspect you will not be able to guarentee that case. Also how are you planning on dealing with mixed use names? for example, Stuart. Consider: `blah la la blah the end, Stuart Bishop is 10 feet tall. and la blah bla la, Ross Stuart is kinda short.` [download] Unless you do some complicated magic you are going to potentally get false matches ("end," in this case for the first name). -Waswas	[reply] [d/l]
Re: Re: Re: Efficiency in regex by jryan (Vicar) on Dec 28, 2002 at 04:10 UTC
No. Paladin's solution has 2 operations for each piece of data: 1 match, 2 lookup. The original solution has anywhere from 1 to n (in this case, n=15) operations for each piece of data, depending on how soon the item matches. Paladin's is a constant O(2), while the original will average around O(n/2). A test: use Benchmark; my %names; my (@list) = qw(Jones Rogers Edwards Smith Jackson Ryan Jones tilly dws paladin footpad jeffa Elian ybiC TheDamian ); @names{@list} = (1) x @list; my $names = join '\|', @list; my $data = do {local $/; <DATA>}; timethese ( 100_000, { "paladin" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.\|[a-z]+)\s+)+(\w ++))/go){ "$name\n" if exists $names{$name} } }, "original" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.\|[a-z]+)\s+)+(?: +$names))/sgo){ "$name\n" } } }); __DATA__ Dr. Happy Sr. Rogers Senoir. Chacho Senoira. Chachese Mr. Ryan Mrs. Smith (I'm sorry) Ms. Jackson (oooh, I am for reaaal) Dr. Tilly Mr. Elian Asdokfj. adfsdf Ms. asdfasdf Mr. Burns Qsdokfj. adfsdf q. TheDamian Hello. There This. Should Not. Fail [download] And the results: `Benchmark: timing 100000 iterations of optimized, original, paladin... original: 25 wallclock secs (23.14 usr + 0.00 sys = 23.14 CPU) @ 43 +20.77/s (n=100000) paladin: 18 wallclock secs (16.93 usr + 0.00 sys = 16.93 CPU) @ 59 +05.63/s (n=100000)` [download] In response to your update, I think you are mistaken; "end," doesn't match anywhere at all.	[reply] [d/l] [select]
Re: Re: Efficiency in regex by fletcher_the_dog (Friar) on Dec 30, 2002 at 15:25 UTC
I like the idea behind this solution, but this won't work in an instance like this: "Tom Jones is here". $2 will be the word "is" which will not be in the hash.	[reply]
Re: Efficiency in regex by Zaxo (Archbishop) on Dec 28, 2002 at 03:51 UTC
A few pretty much orthogonal bits. Order the list of surnames by frequency, commonest first. Think about index. The surnames are fixed strings. `@foo = substr( $_, $tart, $offset) =~ /$re/;` has a lot to recommend it. The m//sg form suggests you are slurping the file. Unless the records are messy, it's often better to process each record as it's read. Benchmark. Regex and search code performance is subject to subtlety, and small things can make a big difference. After Compline, Zaxo	[reply] [d/l]
Re: Efficiency in regex by bart (Canon) on Dec 28, 2002 at 16:52 UTC
Is there a more efficient way to check for one of the last names than to enumerate them like I have done? Sure. There's a module available which will look for the common prefixes and suffixes in the alternatives, and that way, create a faster regex: Regex::PreSuf	[reply]
Re: Efficiency in regex by gmpassos (Priest) on Dec 28, 2002 at 18:09 UTC
You can build the regex on runtime without eval(): `my @names = qw(Jones Rogers Edwards Smith Jackson) ; my $names_re = join("\|", @names) ; foreach my $name ($text=~/(\b(?:[A-Z](?:\.\|[a-z]+)\s+)+(?:$names_re))/ +sg){ print $name."\n"; }` [download] Graciliano M. P. "The creativity is the expression of the liberty".	[reply] [d/l]