Organising lots of simple regexes

ViceRaid has asked for the wisdom of the Perl Monks concerning the following question:

Hullo

I have a script that's nothing but a lot of simple =~ s///s. It's a redirect script for Squid; customers point their domain to our machine, and that box then forwards that request on to whatever one of our servers should handle it. So, the script looks something like:


while ( <> ) {
...
elsif ( s|http://www.theirsite.com/\W|http://our.server1/theirsite/\n|
+i ){ }
elsif ( s|http://www.theirsite.com/|http://our.server1/theirsite/|i ) 
+{ }
elsif ( s|http://www.dummy.com/\W|http://our.server2/dummy/\n|i ) { }

# .. ad nauseam
}
[download]

A couple of questions. I'd love some ideas for how to make this work more straightforwardly, especially defining the rules more clearly than a long list of regexes. Maybe it could use a text config file which expresses the simple, similar regexes that should be compiled at start up?

Secondly, since only one rule gets applied to each incoming URL, the most frequently used rules (which we can test against the logs) should go near the top, and the others near the bottom. However, it's a royal PITA to test and develop this. Any ideas on how to benchmark this painlessly, or a better algorithm - perhaps something B-Tree-ish - to order the rules?

thanks
ViceRaid

Update: Clarified as per diotalevi's nit

Comment on Organising lots of simple regexes Download Code

Replies are listed 'Best First'.
Re: Organising lots of simple regexes by hardburn (Abbot) on Jan 27, 2004 at 14:23 UTC
Put your rules in a hash, which stores an arrayref with the first element as the host and the second as the path: `my %REDIRECTS = ( 'www.theirsite.com' => [qw( our.server1 theirsite/ )], 'www.dummy.com' => [qw( our.server2 dummy/)], # And so on );` [download] Use URI.pm to parse your input and get the host: `use URI; my $uri = URI->new($_); my $host = lc $uri->host; # Need to make sure the case is correct` [download] And then simply pull the redirect out of the hash and do what your load of `s///` functions are doing now: `$_ = join '/', @{ $REDIRECT{$host} };` [download] If you want to have some action happen for each host, you can put a subroutine ref as the third element of the arrayref (you'll need to add an array slice to the last line of code above). ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]
Re: Re: Organising lots of simple regexes by ViceRaid (Chaplain) on Jan 27, 2004 at 16:05 UTC
Thanks all for your ideas. As an alternate way of writing this, so I don't have to have lots of subroutine refs in a big hash structure, I'm now thinking of doing something like this:</code> `use strict; package Our::Redirects; sub www_theirsite_com { my $url = shift; # and transform url here however } # allow for alt domain names *www_theirsite_co_uk = \&www_theirsite_com; package main; use URI; while ( <> ) { my $uri = URI->new($_) or die "Can't parse URI"; my $func = lc( $uri->host() ); $func =~ tr/.-/_/; if ( defined &{ "Our::Redirects::$func" } ) { &{ \&{ "Our::Redirects::$func" } }($_); } }` [download] This seems to me to have the advantages of a hash-type construct - i.e. straight-thru mapping - but slightly sugary syntax, particularly for some of the cases which aren't quite as simple as the ones I suggested. Does this seem like a reasonable way to go? Thanks again	[reply] [d/l]
Re: Re: Re: Organising lots of simple regexes by hardburn (Abbot) on Jan 27, 2004 at 16:14 UTC
That's reasonable. I would change the code to call the subroutine as `Our::Redirects->$func` (the subroutines would have to be changed to something like `sub www_theirsite_com { my ($class, $url) = @_; . . . }`. Also, I would pass the URI object instead of its string form. Lastly, wrap the call in an `eval` to get rid of the conditional (such as `eval { Our::Redirects->$func($uri) };`). ---- I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident. -- Schemer `: () { :\|:& };:` Note: All code is untested, unless otherwise stated	[reply] [d/l] [select]
Re: Organising lots of simple regexes by Abigail-II (Bishop) on Jan 27, 2004 at 13:53 UTC
I wouldn't spend time dynamically shuffling the order. Just log the "hits", and generate a new program every hour/week/month/whatever. Write the script like this: `while (<>) { s!http://www.theirsite.com/\W!http://our.server1/theirsite/\n!i + \|\| s!http://www.theirsite.com/!http://our.server1/theirsite/\|i \|\| s!http://www.dummy.com/\W!http://our.server2/dummy/\n!i \|\| ... logme ($&) }` [download] with `logme` an efficient log function. (And for those who want to complain about `$&`, think before you post). Abigail	[reply] [d/l]
Re: Organising lots of simple regexes by BrowserUk (Patriarch) on Jan 27, 2004 at 16:27 UTC
This begs for the use of an 'ini' file and to exploit the lookup abilities of a hash. As it appears that the url is embedded in other stuff, and you appear to want to root pathed urls to the root of the redirects--Is that what the \W => \n stuff is doing?--then you can't use a straight lookup in a hash. You first need to separate out the domain name for the lookup, and substitute that back into the original string. `#! perl -slw use strict; use Data::Dumper; my %map = map{ split /=/, $_, 2 } split /\n/, do{ local $/; <DATA> }; print Dumper \%map; while( <> ) { chomp; if( m[http://([^/]+)/(\W?)]i ) { next unless defined $map{ $1 }; substr( $_, $-[0], $+[0] - $-[0] ) = 'http://' . $map{ $1 } . ( length $2 ? $2 : "\n$2" ); } print; } __DATA__ www.theirsite.com=our.server1/theirsite/ www.dummy.com=our.server2/dummy/` [download] P:\test>type junk.txt http://www.theirsite.com/ http://www.dummy.com/ http://www.theirsite.com/stuff http://www.dummy.com/stuff http://www.theirsite.com/ stuff http://www.dummy.com/ stuff P:\test>perl 324406-2.pl < junk.txt $VAR1 = { 'www.theirsite.com' => 'our.server1/theirsite/', 'www.dummy.com' => 'our.server2/dummy/' }; http://our.server1/theirsite/ http://our.server2/dummy/ http://our.server1/theirsite/ stuff http://our.server2/dummy/ stuff http://our.server1/theirsite/ stuff http://our.server2/dummy/ stuff [download] You could probably use a better regex for getting the domain and one of the Config::* modules for parsing the ini file. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Timing (and a little luck) are everything!	[reply] [d/l] [select]
Re: Organising lots of simple regexes by ambrus (Abbot) on Jan 27, 2004 at 14:30 UTC
It might look better if you change elsif's to or's: `s\|this\|that\| or s\|foo\|bar\|` [download] (But don't do this: `s\|this\|that\|\|\|s\|foo\|bar\|`)	[reply] [d/l] [select]
Re: Organising lots of simple regexes by diotalevi (Canon) on Jan 27, 2004 at 15:56 UTC
This is just a nit but all of those expressions are compiled at startup time before using any of them.	[reply]
Re: Re: Organising lots of simple regexes by ViceRaid (Chaplain) on Jan 27, 2004 at 16:11 UTC
It's a fair nit, guv'nor. What I meant was compiled from something other than perl regex literals, like a text config file or xml or something else. I'll update the root node to make it clearer. Cheers.	[reply]
Re: Re: Re: Organising lots of simple regexes by diotalevi (Canon) on Jan 27, 2004 at 17:18 UTC
See /o is dead, long live qr//! for more than you ever wanted to know about when regular expressions are compiled and how you can control that.	[reply]
Re: Organising lots of simple regexes by ysth (Canon) on Jan 27, 2004 at 18:53 UTC
I think you want \B, not \W. Can't see why you'd want to actually remove a character as \W does. (\B will be equivalent to (?!\w) there).	[reply]