ketema has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks, I have The following text:
Frank Doe - Toronto SASS,Ajax; Aurora; Brampton; Greenwood; Kilbride; +Maple; Markham; Mississauga; Oshawa; Pickering; Richmond Hill; Thornh +ill; Toronto; Whitby; Woodbridge,4/23/2009,TASS,0
all one line. So far i have the following regex:
^(.*,)(.*;\s)?(.*,)(.*,)(.*,)(\d)$
the second capture group can vary, meaning the line of text could have:
Frank Doe - Toronto SASS,Ajax; Aurora; Woodbridge,4/23/2009,TASS,0
or even
Frank Doe - Toronto SASS,Ajax; Greenwood,4/23/2009,TASS,0
There will always be at least One match on the second capture group, but I am interested in getting that group broken up, so that for each pattern of .*; I get a separate group that I can reference later. How would I go about doing this? Thanks

Replies are listed 'Best First'.
Re: Multiple Capture Groups in RegEx
by shmem (Chancellor) on Apr 24, 2009 at 00:04 UTC

    First, .* is almost always wrong. It is greedy, it does match nothing and anything.
    Second, just do it. Get the second group, then break it up. You will have to deal with the elements anyways. It can be seen as a sport to process all with just one regex, but that's often error prone, unreadable, unmaintainable and slower.

      That makes sense I can easily split on the 2nd capture then process, I guess I was just locked into thinking I could get them all back separately from one regex. as for the .* it works, I could use .+ it works too
        as for the .* it works, I could use .+ it works too

        See perlre and look for greediness too, e.g m/.+?/

        It works ... until you have four comma delimited fields instead of three before the final digit field. .* is best used when you care about the beginning and the end of a string but not the stuff in between. A better way is:([^,]*,) for the comma delimited fields you care about, and a regex that carefully matches the pattern of the multiple semicolon delimited fields, such as ((?:\w*;\s+)*):

        ^([^,]*,)((?:\w*;\s+)*)([^,]*,)([^,]*,)([^,]*,).*,(\d)$

        Note the use of .* to skip past any extra fields before the final digit field. Using [^,]* instead of .* means you only grab up to (and not including) the next comma and no more.

        Best, beth

Re: Multiple Capture Groups in RegEx
by graff (Chancellor) on Apr 24, 2009 at 03:40 UTC
    How about something like this:
    my @comma_groups = split /,/; my @scolon_groups = split /;\s*/, $comma_groups[1];