in reply to Question on Regex grouping

There are many ways to go about this.

First, do not put parens '()' around things that you have no interest in using later. "Capturing" these things consumes time and resources and to no effect.

In general, I avoid using $1, $5 etc. Use Perl list slice instead. Assign directly to a variable like the code below shows. As you write more and more Perl code this $1, $2 stuff will appear less and less often.

This term m/abc\d{5}/ is a pre-condition - I have no problem at all with writing code that says: "forget this line if that pre-condition is not satisfied" and that is what the code below does.

Trying to compress things into a single statement gives Perl a bad name as a "write only" language and that reputation is undeserved! I am a big fan of both C and of Perl. It is easy to write obscure stuff in both languages, but you don't have to!

#!/usr/bin/perl -w use strict; my @x = ( 'smomedef12345', 'anabc12345 and there is some def12345678', 'qwerabc12345def55', 'def87654321abc54321', ); foreach (@x) { next unless (m/abc\d{5}/); #pre-condition to look further (my $string8) = m/def(\d{8})/; #puts $string8 in list context #$string8 = (m/def(\d{8})/)[0]; #alternate way with list slice if ( !defined($string8) ) { $string8 = 'undefined'; #Perl 5.10 has a special way to do this #Probably here just do "next;" # because an undefined value means the # regex above did not match! } print "var def=$string8\n"; } __END__ prints: ..note that first item is silently skipped! var def=12345678 var def=undefined var def=87654321 #note that this works even though #the pre-condition of abc\d{5} #occurs later in the line! Wow!
Update: ok a more obtuse solution:
my @x = ( 'smomedef12345', 'anabc12345 and there is some def12345678', 'qwerabc12345def55', 'def87654321abc54321', ); print map{ /abc\d{5}/ and /def(\d{8})/ ? "def=$1\n" : () }@x; __END__ prints: def=12345678 def=87654321
Does essentially the same thing but in a much more obtuse way.
I think the first code is better for a lot of reasons.

Replies are listed 'Best First'.
Re^2: Question on Regex grouping
by JavaFan (Canon) on Dec 21, 2010 at 11:33 UTC
    First, do not put parens '()' around things that you have no interest in using later. "Capturing" these things consumes time and resources and to no effect.
    Most of the cost is paid by the first parenthesis, that is, there's a significant cost difference between not using capturing parens at all, and using capturing parens. Additional parens don't contribute that much.
    In general, I avoid using $1, $5 etc. Use Perl list slice instead.
    Careful here. Using a list slice (which I find quite ugly), or assigning the list to puts the match in list context, which will change the behaviour if /g is present.

    But more importantly, in certain cases, when using list slices, you do not know whether there was a match or not:

    my $a = rand() < .5 ? "f" : "g" my $b = rand() < .5 ? "p" : "o"; my $c = ("foo" =~ /($a)*$b/)[0];
    Did it match, or didn't it? If $c is defined, it matched. But what if $c isn't? If $a eq "g", and $b eq "o", there is a match, but $c is undefined.
      "Additional parens don't contribute that much".

      Fair enough, there is overhead in doing it at all. I am saying "don't over do it".

      list slice, hash slice, etc are some of the most cool features in Perl! You are completely correct in that list slice does not "play well with match global" because the number of things that can be returned is variable and therefore there is no way to specifiy a subset of range indicies that are of interest.

      The classic example of list slice is used when spliting a line and you want 127,[3..5],93,8 things on that line. And I do work with DB lines like that - it is actually common for such a thing. List slice allows me to assign those 6 things directly into variables that mean something within the program. I usually assign vars on the left ($x,$y,$z..) in the order that the following code will use them. And adjust the slice accordingly.

      If you are saying that "do not use list slice when doing a match global", I would absolutely agree with that. And I do not think that I have recommended that.

      In your code, my $c = ("foo" =~ /($a)*$b/)[0]; is an improper use of list slice.

      Properly used, list slice is beautiful.

        In your code, my $c = ("foo" =~ /($a)*$b/)[0]; is an improper use of list slice.
        Then enlighten us, what is the "proper use" of list slices to avoid using $1, etc? Note that the OP used the match in an if statement, so whether the pattern matched or not is important.
        If you are saying that "do not use list slice when doing a match global", I would absolutely agree with that. And I do not think that I have recommended that.
        Well, you wrote:
        In general, I avoid using $1, $5 etc. Use Perl list slice instead.
        If you are using words like in general, followed by an unqualified demand what to use, be prepared for others pointing out cases where the "in general" doesn't work.
      People who dislike list slicing should avoid scripting languages, especially Perl. It's FALSE that you don't know whether there was a match with m//g because the created list is simply empty. Also, the GOATSE ( =()=) recreates the right context if that's an issue. Take this code: $x = "a123b345c7865d87"; @L = ($x =~ /a-z/g)1,3; print "@L"; ## Prins b d @X = ($x =~ /#/g)1,3; print (defined(@X) ? "YES" : "NO"; It prints NO ... therefore, JavaFan, your assertions are FALSE and FALSE. TenThouPerlStudents

        People who dislike list slicing should avoid scripting languages, especially Perl. It's FALSE that you don't know whether there was a match with m//g because the created list is simply empty. Also, the GOATSE ( =()=) recreates the right context if that's an issue.

        Take this code: $x = "a123b345c7865d87"; @L = ($x =~ /a-z/g)1,3; print "@L"; ## Prins b d @X = ($x =~ /#/g)[1,3]; print (defined(@X) ? "YES" : "NO";
        It prints NO ... therefore, JavaFan, your assertions are FALSE and FALSE. TenThouPerlStudents

        Sorry about the format but ... why does one need to use HTML tags to format one's own post???? It's my first post. I guess I'm used to sites more intelligently designed that format as written in the window.

        Nonetheless, the points I made are compelling. JavaFan's assertions are absolutely false. List slicing of //g, if it creates no list, makes the list variable undefined

        I'd also like to add that Perl nitpickers like to get all hot and bothered about lists vs. arrays yet the goatse is the one stop shop that enables //g to be added if one is COUNTING matches.

        I've largely avoided this site because as a lurker I've noticed that the "priests" and above are more interested in obscurantism and showing off than in helping newbies correctly and getting jobs done simply. TMTOWTDI is VASTLY abused here.

Re^2: Question on Regex grouping
by ajguitarmaniac (Sexton) on Dec 21, 2010 at 09:09 UTC

    Thanks Marshall!