BenjiSmith has asked for the wisdom of the Perl Monks concerning the following question:

Another tricky (at least in my book) regex question, that unfortunately looks deceptively simple. Here's some code...

$test = "abc"; while $test =~ m/(a(b?c)?)/g print "$1\n";
Ultimately, I need the regex to produce the following output:

a
ab
abc

Using greedy quantifiers, I'm currently getting only the single match "abc", and if I switch to using reluctant quantifiers, I only get the single match "a".

Of course, I understand why that's the case. The matcher exits as soon as it has found a match (either greedily or reluctantly) and then quits, without trying to find another match.

But for my application, I really need to find all instances of all matches, even if they occur within other matches.

Now, before you answer the question, I know what a lot of you are going to say: "use $2 and $3".

Unfortunately, that's not going to work. The regular expression isn't hard coded into the application. Instead, it's dynamically generated from a list of (possibly thousands) of search terms. In fact, my real code uses non-capturing groups in the interior of the expression, so the only captured group that I can access is $1.

Any ideas?

--BenjiSmith

Replies are listed 'Best First'.
Re: Regex Subexpressions
by liverpole (Monsignor) on Sep 09, 2005 at 23:03 UTC
    I'm not clear on exactly what you're trying to do.  (Especially when you change the original post, and then change it back -- don't do that!)

    Is it your intent to match the strings "a", "ab" and "abc" -all- from just the input text "abc"?  I don't know if that's possible...

    If you could maybe supply examples (even just short ones, such as a dozen of the search terms from which the list will be dynamically generated), with a little more verbosity on what results you're hoping for and *why*, maybe some of us can more expeditiously point you towards a viable solution.

      Yeah, sorry for changing the post around. I thought I was replying to myself when in fact I was editing the original post.

      Anyhow, here's a more concrete example:
      @keywordList = ('john', 'john.smith', 'john.smith@mail.com'); $combinedExpression = combine(@keywordList); # The combined expression looks something like this: # (john(?:\.smith(?:\@mail\.com)?)?) $searchText = "john's username is john.smith and his email address is +john.smith@mail.com"; while $searchText =~ /$combinedExpression/g { print "$1\n"; }
      For this example, I expect to get these results:

      john
      john
      john.smith
      john
      john.smith
      john.smith@mail.com

      Essentially, for every occurrence of every one of my keywords, I need to get a result, even if those keywords occur within other keywords in the input text.

        For one possible implementation of your combine subroutine, consider:

        use Regexp::Assemble; sub combine { my $str = Regexp::Assemble->new->add(@_)->as_string; qr/($str)/; }

        - another intruder with the mooring in the heart of the Perl

      Great looking solutions you guys, but I actually have a few additional design constraints:

      1. No perl code embedded in the regex. After demonstration of the prototype in perl, it will be implemented in the product using Java, so it must be compatible with the Java regex engine.

      2. The regex should only have to be compiled once, so no rewriting of the regex string after starting to iterate through the matches.
        Whoops -- there's you're last reply.  I didn't see it when I was answering your last one.

        I'm going to have to throw in the towel -- what you're asking for is out of my league!  Isn't there some way you could do the equivalent thing in Java code though?   Anyway, good luck!

Re: Regex Subexpressions
by chester (Hermit) on Sep 09, 2005 at 23:14 UTC
    To start, I assume the regex you meant was

    /(a(bc?)?)/g

    because yours as posted doesn't match 'ab', though it does match 'ac'. Forgive me if I'm misinterpreting. I came up with what's below, but I'm not at all sure it works in the general case. You can play with it. It gives your desired output, using my regex. It saves a possible match, then fails the regex on purpose to start backtracking to get all the possible matches.

    use strict; use warnings; my $test = 'abc'; my @matches; $test =~ /(a(bc?)?)(??{push @matches, $^N})(?!)/; print "@matches\n";
    If you can generate the regexes, it would probably be easier just to generate separate regexes.
      To produce the output which BenjiSmith is looking for, I would make one modification, s/push/unshift/:
      #!/usr/bin/perl -w use strict; use warnings; my $test = 'abc'; my @matches; $test =~ /(a(bc?)?)(??{unshift @matches, $^N})(?!)/; print "@matches\n";
      Here's another "solution", which may not be at all correct (it matches an input string of "aaa", for example), but does it shed any more light on what the solution should be?:
      #!/usr/bin/perl -w use strict; use warnings; my $test = "abc"; my $count = 1; while ($test =~ m/([abc]{$count})/) { my $match = $1; ++$count; printf "%s\n", $match; }
Re: Regex Subexpressions
by liverpole (Monsignor) on Sep 09, 2005 at 23:56 UTC
    Okay, how about this for a solution?  It doesn't do it all in a single regex (hey, go easy on me; I'm only at monk level), but it does do what I think it is you're asking for ...
    #!/usr/bin/perl -w use strict; use warnings; # Prototypes sub findMatches($$); # Input data my @keywordList = ( 'john', 'john.smith', 'john.smith@mail.com' ); # Main program my $searchText = "john's username is john.smith and his email address +is john.smith\@mail.com"; my $pmatches = findMatches($searchText, \@keywordList); map { print "$_\n"; } @$pmatches; # Subroutines # # Inputs: $1 ... the text string to match against # $2 ... a pointer to the list of valid matching substrings # # Outputs: $1 ... a pointer to a list of all matches # sub findMatches($$) { my ($text, $plist) = @_; my @matches; foreach my $pattern (@$plist) { while ($text =~ /($pattern)/g) { my $result = $1; # Got a match in $1 push @matches, $result; # Save it to our master res +ults list # Now trim off the first character (otherwise we'll be mat +ching # against the same substring (think 'deep recursion'), and + call # this subroutine again for recursively generated sub-matc +hes. # Whatever we get (if anything) is added to the list. # $result =~ s/^.//; my $psubstr = findMatches($result, $plist); push @matches, @$psubstr; } } return \@matches; }
    By the way, you're a java programmer, right?  I'm not being perjorative or anything (well, maybe a little :-), but I had never heard of a "reluctant qualifier", and when I googled for it, I got mostly Java matches.  That plus the lowerCaseMixedWithUpperCase variable names kinda gave you away ... ;->
      Yep, I'm a java programmer. I've been programming in Java for about five years, but I've also been writing Perl for the last three years. I just picked up Python a few weeks ago, and I really like it a lot.

      I've also done brief stints with VBA, D, C++, and x86 Assembly.

      But I know that--when I have a difficult regex problem--the people at perlmonks are more likely to be able to figure out a solution than anyone else.
        But I know that--when I have a difficult regex problem--the people at perlmonks are more likely to be able to figure out a solution than anyone else.
        I consider that a misuse, perhaps even an abuse, of the openness here.

        That'd be like saying "I know that most people here probably have apple computers, so I'll post my apple questions here". Seeing questions about Java, with Java's restrictions (even if it is regex), is not the reason I come here. Your task was simple with Perl's regex. Java's regex are different, and thus the answers weren't relevant, but you didn't disclose that at the beginning.

        Please mark your questions "off topic" next time, so I can ignore them.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

Re: Regex Subexpressions
by Roy Johnson (Monsignor) on Sep 10, 2005 at 02:45 UTC
    Are you looking for:
    @matches = m/(((a)b)c)/ and print "@matches\n";
    ?

    Caution: Contents may have been coded under pressure.
      Wow!  Now that's a great answer!

      Let me just add the following, since it took me a few minutes to see exactly what you were doing, as well as making it conform to the original requirements for the output:

      $_ = "abc"; # Regex match of /.../ uses $_ when =~ is unspecified # Nicely solves the problem stated, and demonstrates list assignment # to a regex capture. # @matches = m/(((a)b)c)/ and print join "\n", reverse @matches;
      I realized a "reverse" was necessary, and then immediately fell into the trap of putting it before the "join".  I like your answer a lot for its succinct approach!
        Does your solution have to find "ab" in "abd"? If so, the above won't work.
Re: Regex Subexpressions
by ikegami (Patriarch) on Sep 10, 2005 at 03:48 UTC
    Do you have to use just one regexp? The following will work in Java, and is only compiled once:
    my @regexps = ( qr/a/, qr/ab/, qr/abc/, ); my $text = 'aababc'; my @matches; foreach ($text) { # Safe $_ = $text; foreach my $re (@regexps) { push(@matches, /($re)/g); } } print("$_\n") foreach @matches;
    a (at pos 0) a (at pos 1) a (at pos 3) ab (at pos 0) ab (at pos 3) abc (at pos 3)

    Remove the "g" from "/($re)/g" to match each regexp only once.