alexiskb has asked for the wisdom of the Perl Monks concerning the following question:

I have an ugly list of regex, which is as shown below:
sub format { $long =~s/( & CO )[^,]+/ AND CO/; $long =~s/&/AND/g; $long =~s/( EUR1)[^,]+//; $long =~s/( EUR2)[^,]+//; $long =~s/( EUR3)[^,]+//; $long =~s/( EUR4)[^,]+//; $long =~s/( EUR8)[^,]+//; $long =~s/( EUR0\.)[^,]+//; $long =~s/( CHF10)[^,]+//; $long =~s/( Y5)[^,]+//; $long =~s/( NV )[^,]+//; $long =~s/(NON-CUM)[^,]+//; $long =~s/( LTD)[^,]+//; $long =~s/( FIN )[^,]+//; $long =~s/( INTL)[^,]+//; $long =~s/( FIN )[^,]+//; $long =~s/(\$)[^,]+//; $long =~s/\s+$//g; ...etc... for another 40 lines! }
how do i loop through an array of these values, as making it work with the ( | ) is ugly also..thank you...

Replies are listed 'Best First'.
Re: Multiple Regex, it works but it aint clever
by Abigail-II (Bishop) on Aug 06, 2002 at 11:30 UTC
    Here's how I would tackle the problem:
    my @parts = (' EUR[12348]', ' EURO\.', ' CHF10', ' Y5', ' NV', 'NON-CUM', ' LTD', ' FIN ', ' INTL', '\$', ...); sub format { local $" = "|"; $long =~ s/ & CO [^,]+/ AND CO/; $long =~ s/&/AND/g; $long =~ s/(?:@parts)[^,]+//g; $long =~ s/\s+$//g; }
    However, there's a difference. Your code would only remove the first occurence of ' EUR1', the first of ' EUR2', etc, while my code would remove all of the occurences. This may or may not be important to you.

    Abigail

Re: Multiple Regex, it works but it aint clever
by Basilides (Friar) on Aug 06, 2002 at 12:41 UTC
    I guess you're trying to shorten & standardise equity names, so it doesn't really matter what the issue price was, so you could merge all your EUR lines into one:  s/( EUR\d*)[^,]+//. Same for CHF, Y etc.

    As for the dollar currencies, mind out because at the moment I think you've forgotten to escape a $ sign in $long =~s/\s+$//g;. I think you could catch all of 'em--C$, AU$, plain $, etc--with something like s/([\s+]?\$)[^,]+//g. (This may not be exactly right, but perhaps another helpful monk could put it right if it's not).

    HTH

      Chances are that $long =~s/\s+$//g; is there to remove trailing white-space. Of course the g modifier is a bit pointless.

      {NULE}
      --
      http://www.nule.org

      perfect, thank you kindly monks!
Re: Multiple Regex, it works but it aint clever
by {NULE} (Hermit) on Aug 06, 2002 at 12:47 UTC
    Hey alexiskb,

    I may be approaching this more from a mind-set of how to make it fast rather than how to make it elegant, but this is what I would try.

    First I would using the quoting REx operator to precompile the RExs outside of the function. I also switched your parens to the non capturing format: (?:...) since you aren't using $1 in any of your examples. Lastly this is a case where using study might significantly speed your program. Here's what I came up with. Update 2: wait - why use parens at all? Silly me...

    #! /usr/local/bin/perl -w use strict; my $REs = [ qr/(?: EUR1)[^,]+/, qr/(?: EUR2)[^,]+/, qr/(?: EUR3)[^,]+/, qr/(?: EUR4)[^,]+/, qr/(?: EUR8)[^,]+/, qr/(?: EUR0\.)[^,]+/, qr/(?: CHF10)[^,]+/, qr/(?: Y5)[^,]+/, qr/(?: NV )[^,]+/, qr/(?:NON-CUM)[^,]+/, qr/(?: LTD)[^,]+/, qr/(?: FIN )[^,]+/, qr/(?: INTL)[^,]+/, qr/(?:\$)[^,]+/, qr/(?:\s+$)/ ]; my $string = "a INTL , b Y5c, NV , d e f... & & CO FIN "; my $return = &format($REs, $string); print ">$string<\n"; print ">$return<\n"; exit; sub format { my $REs = shift; my $string = shift; study $string; $string =~ s/(?: & CO )[^,]+/ AND CO/; $string =~ s/&/AND/g; for (@{$REs}) { $string =~ s/$_//; } return $string; }
    Without a sample of your data and output it is hard to be sure that this does what you want. It should, however, be faster, and the format subroutine is a little cleaner to look at.

    Hope this helps you. If I have time I may run this through some benchmarks and see if it does in fact speed things up, and see which part helps the most.

    Update: Oops - typos.

    Good luck,
    {NULE}
    --
    http://www.nule.org

Re: Multiple Regex, it works but it aint clever
by RMGir (Prior) on Aug 06, 2002 at 13:44 UTC
    You might want to look into Regex::PreSuf, which is designed to build an optimal regex to match many strings like yours.

    (From the perldoc:)

    use Regex::PreSuf; my $re = presuf(qw(foobar fooxar foozap)); # $re should be now 'foo(?:zap|[bx]ar)'
    You could then do an s/$re//og, once you've built $re for your set of strings. --
    Mike
Re: Multiple Regex, it works but it aint clever
by fruiture (Curate) on Aug 06, 2002 at 14:18 UTC

    use a hash!? and: what do you need the parentheses for?

    sub format { my %replace = ( ' & CO [^,]+' => ' AND CO', '&' => 'AND', #followed by [^,]+ map( {( quotemeta($_).'[^,]+' , '' )} #leading space map(" $_", qw/EUR1 EUR2 EUR4 EUR8 EUR0. CHF Y5 LTD INTL/, 'NV ','FIN ' ), #no leading space '$','NON-CUM' ), '\s+$' => '', #others... ); # search longest first for(reverse sort keys %replace){ $long =~ s/$_/$replace{$_}/g; #/g seems usefull } }

    edit: missing replacement for '\s+$' added

    --
    http://fruiture.de