Slow Regex - How to Optimize

noslenj123 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Slow Regex - How to Optimize by dave_the_m (Monsignor) on Aug 30, 2005 at 21:50 UTC
You code causes the regex to be recompiled each time round the inner loop. A slight rearrangement should improve things: `foreach my $sub ( keys %SUBS ) { my $re = qr/[^a-zA-Z]$sub[^a-zA-Z]*\(/; foreach my $line ( @sub_code ) { if ( $line =~ $re ) { push( @subs, $key ) } } }` [download] Dave.	[reply] [d/l]
Re^2: Slow Regex - How to Optimize by ikegami (Patriarch) on Aug 30, 2005 at 22:05 UTC
That's not true. `m/.../` only recompiles if the contents of any interpolated variable has changed. Demo: `use re 'debug'; foreach (qw( abc abc def )) { print("==============================\n"); print("$_\n"); print("\n"); /$_/; }` [download] Update: Oops, I didn't notice you reversed the loops in addition to moving the regexp.	[reply] [d/l] [select]
Re^2: Slow Regex - How to Optimize by Anonymous Monk on Mar 11, 2016 at 14:14 UTC
great post.. thanks a lot :) it really solved an issue here. reduced the runtime of a usecase from 30 minutes to ~10 seconds. hats off! :)	[reply]
Re: Slow Regex - How to Optimize by GrandFather (Saint) on Aug 30, 2005 at 23:32 UTC
Benchmark results for the curious `Rate Original dave_the_m rnahi borisz ikegami + anon Original 17.7/s -- -97% -99% -99% -100% + -100% dave_the_m 603/s 3313% -- -76% -82% -87% + -87% rnahi 2502/s 14054% 315% -- -26% -46% + -47% borisz 3392/s 19088% 462% 36% -- -27% + -28% ikegami 4642/s 26158% 669% 86% 37% -- + -2% anon 4736/s 26693% 685% 89% 40% 2% + --` [download] Read more... Test code (5 kB) Perl is Huffman encoded by design.	[reply] [d/l] [select]
Re^2: Slow Regex - How to Optimize by ikegami (Patriarch) on Aug 31, 2005 at 04:40 UTC
Two notes: 1) borisz's solution should be even faster on Perl 5.10 with due to a new regexp engine optimization. 2) My solution and anonymonk's assume that %SUBS keys aren't regexps.	[reply]
Re: Slow Regex - How to Optimize by rnahi (Curate) on Aug 30, 2005 at 21:56 UTC
I would use a different approach, i.e. make a single string from your @sub_code and apply the search for each key just once. `#untested my $code = join("", @sub_code); foreach my $sub ( keys %SUBS ) { while ( $code =~ /\b$sub\b\(/g ) { push( @subs, $sub ) ; } }` [download] Notice that your code has a subtle bug. If the same routine is used twice in oone line, you'll get it only once. E.g.: `sqrt(x) + sqrt(y)` BTW, what is that $key in your code?	[reply] [d/l] [select]
Re^2: Slow Regex - How to Optimize by Anonymous Monk on Aug 30, 2005 at 23:16 UTC
Notice that your code has a subtle bug. Even if he is running the code through CPP first (to strip out comments, expand macros, rejoin lines), he still has plenty of corner cases to worry about (strings, fucntion pointers, etc)... `#define p printf int mai\ n/this is a comment: main()/(int argc, char *argv) { int (f)(int, char **) = &main; p("hello world: main()\n"); if(argc>0) f((argc-1),argv); }` [download]	[reply] [d/l]
Re^3: Slow Regex - How to Optimize by Anonymous Monk on Aug 30, 2005 at 23:30 UTC
Oh, and don't forget that `[^a-zA-Z]` matches "(" and that C identifiers can have digits and underscores... `_foo2bar((4),2);` [download]	[reply] [d/l] [select]
Re: Slow Regex - How to Optimize by ikegami (Patriarch) on Aug 30, 2005 at 22:10 UTC
Maybe using a generic regexp, then checking against %SUBS would be better? That eliminates a nested loop and multiple regexp compilations. `foreach my $line ( @sub_code ) { if ( $line =~ /(\w+)\(/ ) { if ( exists $SUBS{$1} ) { push( @subs, $1 ); } } }` [download] Note: I assumed %SUBS keys are subroutine names, not regexps.	[reply] [d/l]
Re: Slow Regex - How to Optimize by noslenj123 (Scribe) on Aug 30, 2005 at 22:50 UTC
As usual, I didn't explain the concept of what I'm trying to accomplish along with the problem. I gather a list of created subroutines by parsing the .h files. That creates %SUBS. For each $sub in %SUBS I need to gather all the subroutines that the $sub calls, but only if is exists in %SUBS. From all your inputs I have learned how to work with the code as a string and apply the regex in a while loop which is so far much faster. I'm trying something like: `while ( $data =~ /\b([a-zA-Z]+)\b\s*\(/g ) { print "$1\n"; }` [download] Thanks for the direction! I'll post results and look for more advice.	[reply] [d/l]
Re: Slow Regex - How to Optimize by borisz (Canon) on Aug 30, 2005 at 22:09 UTC
`my $str = join '\|', sort { length $b <=> length $a } keys %SUBS; my $re = qr/[^a-zA-Z]($str)[^a-zA-Z]*\(/; /$re/ and push @subs, $1 for ( @sub_code );` [download] Boris	[reply] [d/l]
Re: Slow Regex - How to Optimize by InfiniteSilence (Curate) on Aug 30, 2005 at 22:22 UTC
If I am not mistaken, this can be rewritten as: `#!/usr/bin/perl -w use strict; my ($wholefile); my (@keys, %SUBS); $SUBS{'b'} = 1; $SUBS{'c'} = 1; local $/; open(H,qq\|$ARGV[0]\|) or die "USAGE: wierd2.pl <filename>"; #sloppy exa +mple $wholefile .= <H>; close(H); foreach (keys %SUBS) { while($wholefile=~m/\b$_\b/g){ push @keys, $_; } } 1;` [download] Of course, I have no idea how: You are getting the %SUBS hash populated in the first place. I thought that was the purpose of the script? Why you would want to do this Celebrate Intellectual Diversity	[reply] [d/l]
Re: Slow Regex - How to Optimize by noslenj123 (Scribe) on Aug 30, 2005 at 23:05 UTC
Okay guys! You rock! I recoded it and now instead of that process taking 64+ seconds, it now take .064 seconds. And I learned some more perl to boot! :-) Tx all!	[reply]
Re: Slow Regex - How to Optimize by Anonymous Monk on Aug 30, 2005 at 22:24 UTC
I'd try a non-greedy quantifier... `if ( $line =~ /[^a-zA-Z]$sub[^a-zA-Z]?\(/ ) {` [download] ...and maybe a more generic subroutine finder... `foreach my $line ( @sub_code ) { if ($line =~ /[^a-zA-Z]([a-zA-Z_]+[a-zA-Z_0-9])[^a-zA-Z]*\(/ and exists $SUBS{$1}) { push @subs, $key } }` [download]	[reply] [d/l] [select]


Just another Perl shrine
	PerlMonks