Re: Efficiency in regex

Replies are listed 'Best First'.
Re: Re: Efficiency in regex by waswas-fng (Curate) on Dec 28, 2002 at 03:16 UTC
Or make an array of qr'ed regexes that you can loop on on each line if the data files are large and you don't want to slurp. However, I think if you benchmark it one qred or list like you have orig will still be faster. Edited to add: One more note if you know only one match can happen per line the array of qr'ed regex may be faster if you last on the first match. But I suspect you will not be able to guarentee that case. Also how are you planning on dealing with mixed use names? for example, Stuart. Consider: `blah la la blah the end, Stuart Bishop is 10 feet tall. and la blah bla la, Ross Stuart is kinda short.` [download] Unless you do some complicated magic you are going to potentally get false matches ("end," in this case for the first name). -Waswas	[reply] [d/l]
Re: Re: Re: Efficiency in regex by jryan (Vicar) on Dec 28, 2002 at 04:10 UTC
No. Paladin's solution has 2 operations for each piece of data: 1 match, 2 lookup. The original solution has anywhere from 1 to n (in this case, n=15) operations for each piece of data, depending on how soon the item matches. Paladin's is a constant O(2), while the original will average around O(n/2). A test: use Benchmark; my %names; my (@list) = qw(Jones Rogers Edwards Smith Jackson Ryan Jones tilly dws paladin footpad jeffa Elian ybiC TheDamian ); @names{@list} = (1) x @list; my $names = join '\|', @list; my $data = do {local $/; <DATA>}; timethese ( 100_000, { "paladin" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.\|[a-z]+)\s+)+(\w ++))/go){ "$name\n" if exists $names{$name} } }, "original" => sub { my $text = $data; foreach my $name ($text=~/(\b(?:[A-Z](?:\.\|[a-z]+)\s+)+(?: +$names))/sgo){ "$name\n" } } }); __DATA__ Dr. Happy Sr. Rogers Senoir. Chacho Senoira. Chachese Mr. Ryan Mrs. Smith (I'm sorry) Ms. Jackson (oooh, I am for reaaal) Dr. Tilly Mr. Elian Asdokfj. adfsdf Ms. asdfasdf Mr. Burns Qsdokfj. adfsdf q. TheDamian Hello. There This. Should Not. Fail [download] And the results: `Benchmark: timing 100000 iterations of optimized, original, paladin... original: 25 wallclock secs (23.14 usr + 0.00 sys = 23.14 CPU) @ 43 +20.77/s (n=100000) paladin: 18 wallclock secs (16.93 usr + 0.00 sys = 16.93 CPU) @ 59 +05.63/s (n=100000)` [download] In response to your update, I think you are mistaken; "end," doesn't match anywhere at all.	[reply] [d/l] [select]
Re: Re: Efficiency in regex by fletcher_the_dog (Friar) on Dec 30, 2002 at 15:25 UTC
I like the idea behind this solution, but this won't work in an instance like this: "Tom Jones is here". $2 will be the word "is" which will not be in the hash.	[reply]