whatever has asked for the wisdom of the Perl Monks concerning the following question:

Is there NO way to capture all the matches to a numerically quantified subexpression? e.g.:
my $ham = "spam\tspam\tspam\t\tyam\tclam"; my @jam = ($ham =~ (m/^[^\t]*\t[^\t]*(?:\t([^\t]*)){3}\t[^\t]*$/)); print join("\n", '**', @jam, '**', '');
What I get is this:
** yam **
because each subsequent match to the group stomps on the previous one. What I want is this:
** spam yam **
I have searched and searched and searched for a simple answer to this, and I have come up empty. Note that, for my purposes, the solution of pulling the inner set of three out as a string and splitting that string on tabs is NOT satisfactory: I want/need to grab all the desired fields in one go. This should be trivial, yet it seems to be impossible.

Replies are listed 'Best First'.
Re: How to capture quantified repeats?
by BrowserUk (Patriarch) on Sep 22, 2010 at 19:04 UTC
    This should be trivial

    T'is trivial if you use the right tool:

    $ham = "spam\tspam\tspam\t\tyam\tclam";; @jam = split /\t+/, $ham;; print "@jam";; spam spam spam yam clam

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How to capture quantified repeats?
by kennethk (Abbot) on Sep 22, 2010 at 20:22 UTC
    After reading other responses and your replies, I offer two solutions:
    1. Text::CSV and its ilk - You are dealing with large delimited data files, so why not use a module designed to handle those?
    2. If you have a line in memory and you know you want to get the 1st, 4th and 5th terms, why not just grab those terms?

      #!/usr/bin/perl use strict; use warnings; my $ham = "spam\tspam\tspam\t\tyam\tclam"; my @jam; for my $i (0, 3, 4) { push @jam, $ham =~ /(?:[^\t]*\t){$i}([^\t]*)/; } print join("\n", '**', @jam, '**', '');

      You could even code that into a single expression if you only wanted to run it once.

Re: How to capture quantified repeats?
by JavaFan (Canon) on Sep 22, 2010 at 19:48 UTC
    Is there NO way to capture all the matches to a numerically quantified subexpression?
    Indeed, there isn't. The number of capture groups is set by the number of capturing parens in the regular expression.
    This should be trivial, yet it seems to be impossible.
    I'm glad someone thinks the code dealing with regular expression in perl is trivial. AKAIK, you're unique. I'd think that patches would be more than welcome.

    Actually, since 5.10, the number of capturing parens is an upper bound, due to the (?|) construct.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How to capture quantified repeats?
by TomDLux (Vicar) on Sep 22, 2010 at 20:23 UTC

    How about using variables to turn confusion to clarity? Better yet, go to the library and borrow Perl Best practices to see why '\A' is better than \$'. spreading out your regex and adding comments, helps, too.

    my $ham = "spam\tspam\tspam\tyam\tclam"; my $word = qr{[^\t]+}; my $sep = qr{\t}; my $capture = qr{($word)}; my (@jam ) = ($ham =~ m{\A # enforce beginning of stri +ng $word $sep # skip first word and separ +ator $word $sep # and the second $capture $sep # capture next two words $capture $sep # skip the separators $word # skip a word \z # and then it's the end of +the string }xms); print join("\n", '**', @jam, '**', '');

    Using the debugger helps, too ... Along the way I noticed your string has a double tab '\t\t', but you only ever accept single tabs; You specify end of string '$', when there's still another word to go.

    But you're doing too much work. You could split() on '\t' and select only the components you want. If it's the 3rd % 4th ...

    my @jam = ( spit "\t", $ham )[2,3];

    If you do need to use a regex, do you need to check whether there is a word after your capture? Do you need to enforce there is nothing after that last word? Simplify your regex for greater happiness.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Something like:

      my $ham = "spam\tspam\tspam\t\tyam\tclam"; my @jam; $ham =~ s/^[^\t]*\t[^\t]*((?:\t[^\t]*){3})\t[^\t]*$/push @jam,(split " +\t",$1);$1/eg; print join("\n", '**', @jam, '**', '');

      maybe?

Re: How to capture quantified repeats?
by moritz (Cardinal) on Sep 23, 2010 at 07:09 UTC
    Is there NO way to capture all the matches to a numerically quantified subexpression?

    Of course there is:

    use v6; if "abc" ~~ /(.)+/{ say $0.join(", "); }

    In Perl 6, quantifying a capturing group or atom just results in an array of Match objects.

    Perl 6 - links to (nearly) everything that is Perl 6.
      Thanks. Not sure whether Perl 6 is an option, but I'll take a look.
Re: How to capture quantified repeats?
by umasuresh (Hermit) on Sep 22, 2010 at 19:28 UTC
    You may have better luck changing the code to:
    my $ham = "spam\tspam\tspam\t\tyam\tclam"; my @jam = ($ham =~ (m/^[^\t]+\t[^\t]+\t([^\t]+)(\t\t)([^\t]+)\t[^\t]+$ +/)); print join("\n", '**', @jam, '**', '');
    prints:
    ** spam yam **
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: How to capture quantified repeats?
by umasuresh (Hermit) on Sep 22, 2010 at 19:47 UTC
    If you know which fields you want to extract say for e.g. columns 2,4 in a really large file, you can try
    my ($col1, $col2) = (split(/\t/, $ham))[2,4] ;
Re: How to capture quantified repeats?
by james2vegas (Chaplain) on Sep 23, 2010 at 15:18 UTC
    Sure, you can, but you'd probably need to use an extra variable, or two:

    use strict; use warnings; my $ham = "spam\tspam\tspam\t\tyam\tclam"; my @foo; my @jam = ($ham =~ (m/^[^\t]*\t[^\t]*(?:\t([^\t]*)(?{push @foo, $^N})) +{3}\t[^\t]*$/)); print join("\n", '**', @jam, '**', ''); print join("\n", '**', @foo, '**', '');

    You may need to do some variation of the local variable dance as shown in perlretut if backtracking is a concern.
      There are a couple of issues with that code, but the OP made it clear he only want a yes/no answer. I was going to post something of the kind, but it's not nearly as good as BrowserUk's solutions.