Crackers2 has asked for the wisdom of the Perl Monks concerning the following question:

I have a string like this:

$_ = "This is list number 12. It contains apples, pears, peaches. Total cost is 5.";

and I want to extract the elements of the list (apples, pears, peaches). I would like to do this using a single regex.
What I came up with is this:

my (@rep) = /It contains (?:\s*([^,]*),)*\s*([^\.]*)\./;

But even though this does match the text, @rep contains just pears and peaches.
Am I missing something trivial or should I just give up on the single regex idea and use multiples regexes and/or split to get what I want?

The example above is made up, but the actual strings look structurally the same: a preamble, followed by a variable number of <text><separator> pairs, followed by a postamble

Update: ikegami++ for the most practical solution, davido++ for a great single-regex solution

Update: pijll++ for doing both practical and single-regex

Replies are listed 'Best First'.
Re: Regex to extract multiple occurrences
by ikegami (Patriarch) on Sep 11, 2004 at 04:24 UTC

    You're confusing the behavious of m//g with m//.

    Without /g, a capture (parens) will only return one match (the last one), even if the expression within match more than once (as is the case here). So while the first capture matches both 'apples' and 'pears', only 'pears' is returned for the first capture. What follows is a solution where /g used to return mulitple values for one capture:

    my (@rep) = /It contains ([^.]*)\./ && $1 =~ /\G(\S[^,]*)(?:,\s*)?/g;

    But you might as well use:

    my (@rep) = /It contains ([^.]*)\./ && split(/\s*,\s*/, $1);
Re: Regex to extract multiple occurrences
by davido (Cardinal) on Sep 11, 2004 at 04:54 UTC

    ikegami provided a very good answer for why your original regexp didn't return the captures you were hoping for, and he provided a good m//g solution. But I wanted to point out yet another way to do it, using the experimental (?{...code...}) construct, and as you requested, a single regular expression. ;)

    use strict; use warnings; use vars qw/@array/; $_ = "This is list number 12. It contains apples, pears, peaches. Tota +l cost is 5."; if ( m/contains\s+ (?:(\w+)(?{push @array, $^N}),\s+)* (?:(\w+)(?{push @array, $^N})\.\s+\b) /x ) { print "@array\n"; }

    I used the /x modifier to break the regexp into smaller chunks so you could see more clearly what's going on. The special variable $^N contains the most recent capture within the regexp. So even though $1 only contains the last match in the first set of parens, if you check at the proper point in time, it (and $^N) will contain that first capture too. Be sure to read up in perlre for details. This is a tricky subject, but kind of fun, IMHO.


    Dave

Re: Regex to extract multiple occurrences
by pijll (Beadle) on Sep 11, 2004 at 09:16 UTC
    Another solution, 1 regexp, and a bit less complicated than that of davido:
    my (@rep) = /(?:It contains|\G) (\w+)[.,]/g;
    This uses the \G anchor to continue where the previous match ended.
Re: Regex to extract multiple occurrences
by dragonchild (Archbishop) on Sep 11, 2004 at 03:43 UTC
    Personally, I would use two steps
    1. Extract the text you want using a regex
    2. Split the text using split

    But, to solve your problem, try

    my (@rep) = /It contains (?:\s*([^,]+)+,)\s*([^\.]*)\./;

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      That regexp doesn't work. It returns 'apples', 'pears, peaches'. See my post for the reason both your solution and the OP's solution don't work. ++ for suggesting split, however.

      Looks like two steps is the easiest way to go indeed.

      Your regex bunches up all of the occurrences except for the first one into a single capture, so I'd still have to split it and might as well make the capture include the first element as well

Re: Regex to extract multiple occurrences
by dr3ad (Initiate) on May 30, 2022 at 05:27 UTC
    my (@rep) = /(?:It contains\s+|\G)(?:([^,.\s]+)[,.]\s)/g;
    perl -d DB<1> $_ = "This is list number 12. It contains apples, pears, peach +es. Total cost is 5."; DB<2> x /(?:It contains\s+|\G)(?:([^,.\s]+)[,.]\s)/g 0 'apples' 1 'pears' 2 'peaches'

    Starting after 'It contains ', this captures anything that isn't a dot, comma, or space that is followed by a dot or comma followed by space. Requiring that a space follows the matched dot or comma eliminates '5.' from the results.