Multiple Regex, it works but it aint clever

alexiskb has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multiple Regex, it works but it aint clever by Abigail-II (Bishop) on Aug 06, 2002 at 11:30 UTC
Here's how I would tackle the problem: `my @parts = (' EUR[12348]', ' EURO\.', ' CHF10', ' Y5', ' NV', 'NON-CUM', ' LTD', ' FIN ', ' INTL', '\$', ...); sub format { local $" = "\|"; $long =~ s/ & CO [^,]+/ AND CO/; $long =~ s/&/AND/g; $long =~ s/(?:@parts)[^,]+//g; $long =~ s/\s+$//g; }` [download] However, there's a difference. Your code would only remove the first occurence of ' EUR1', the first of ' EUR2', etc, while my code would remove all of the occurences. This may or may not be important to you. Abigail	[reply] [d/l]
Re: Multiple Regex, it works but it aint clever by Basilides (Friar) on Aug 06, 2002 at 12:41 UTC
I guess you're trying to shorten & standardise equity names, so it doesn't really matter what the issue price was, so you could merge all your EUR lines into one: `s/( EUR\d*)[^,]+//`. Same for CHF, Y etc. As for the dollar currencies, mind out because at the moment I think you've forgotten to escape a $ sign in `$long =~s/\s+$//g;`. I think you could catch all of 'em--C$, AU$, plain $, etc--with something like `s/([\s+]?\$)[^,]+//g`. (This may not be exactly right, but perhaps another helpful monk could put it right if it's not). HTH	[reply] [d/l] [select]
Re: Re: Multiple Regex, it works but it aint clever by {NULE} (Hermit) on Aug 06, 2002 at 12:51 UTC
Chances are that `$long =~s/\s+$//g;` is there to remove trailing white-space. Of course the `g` modifier is a bit pointless. {NULE} -- http://www.nule.org	[reply] [d/l] [select]
Re: Re: Multiple Regex, it works but it aint clever by alexiskb (Acolyte) on Aug 06, 2002 at 13:37 UTC
perfect, thank you kindly monks!	[reply]
Re: Multiple Regex, it works but it aint clever by {NULE} (Hermit) on Aug 06, 2002 at 12:47 UTC
Hey alexiskb, I may be approaching this more from a mind-set of how to make it fast rather than how to make it elegant, but this is what I would try. First I would using the quoting REx operator to precompile the RExs outside of the function. I also switched your parens to the non capturing format: `(?:...)` since you aren't using $1 in any of your examples. Lastly this is a case where using `study` might significantly speed your program. Here's what I came up with. Update 2: wait - why use parens at all? Silly me... #! /usr/local/bin/perl -w use strict; my $REs = [ qr/(?: EUR1)[^,]+/, qr/(?: EUR2)[^,]+/, qr/(?: EUR3)[^,]+/, qr/(?: EUR4)[^,]+/, qr/(?: EUR8)[^,]+/, qr/(?: EUR0\.)[^,]+/, qr/(?: CHF10)[^,]+/, qr/(?: Y5)[^,]+/, qr/(?: NV )[^,]+/, qr/(?:NON-CUM)[^,]+/, qr/(?: LTD)[^,]+/, qr/(?: FIN )[^,]+/, qr/(?: INTL)[^,]+/, qr/(?:\$)[^,]+/, qr/(?:\s+$)/ ]; my $string = "a INTL , b Y5c, NV , d e f... & & CO FIN "; my $return = &format($REs, $string); print ">$string<\n"; print ">$return<\n"; exit; sub format { my $REs = shift; my $string = shift; study $string; $string =~ s/(?: & CO )[^,]+/ AND CO/; $string =~ s/&/AND/g; for (@{$REs}) { $string =~ s/$_//; } return $string; } [download] Without a sample of your data and output it is hard to be sure that this does what you want. It should, however, be faster, and the `format` subroutine is a little cleaner to look at. Hope this helps you. If I have time I may run this through some benchmarks and see if it does in fact speed things up, and see which part helps the most. Update: Oops - typos. Good luck, {NULE} -- http://www.nule.org	[reply] [d/l] [select]
Re: Multiple Regex, it works but it aint clever by RMGir (Prior) on Aug 06, 2002 at 13:44 UTC
You might want to look into Regex::PreSuf, which is designed to build an optimal regex to match many strings like yours. (From the perldoc:) `use Regex::PreSuf; my $re = presuf(qw(foobar fooxar foozap)); # $re should be now 'foo(?:zap\|[bx]ar)'` [download] You could then do an s/$re//og, once you've built $re for your set of strings. -- Mike	[reply] [d/l]
Re: Multiple Regex, it works but it aint clever by fruiture (Curate) on Aug 06, 2002 at 14:18 UTC
use a hash!? and: what do you need the parentheses for? `sub format { my %replace = ( ' & CO [^,]+' => ' AND CO', '&' => 'AND', #followed by [^,]+ map( {( quotemeta($_).'[^,]+' , '' )} #leading space map(" $_", qw/EUR1 EUR2 EUR4 EUR8 EUR0. CHF Y5 LTD INTL/, 'NV ','FIN ' ), #no leading space '$','NON-CUM' ), '\s+$' => '', #others... ); # search longest first for(reverse sort keys %replace){ $long =~ s/$_/$replace{$_}/g; #/g seems usefull } }` [download] edit: missing replacement for '\s+$' added -- http://fruiture.de	[reply] [d/l]