Regex Subexpressions

BenjiSmith has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex Subexpressions by liverpole (Monsignor) on Sep 09, 2005 at 23:03 UTC
I'm not clear on exactly what you're trying to do. (Especially when you change the original post, and then change it back -- don't do that!) Is it your intent to match the strings "a", "ab" and "abc" -all- from just the input text "abc"? I don't know if that's possible... If you could maybe supply examples (even just short ones, such as a dozen of the search terms from which the list will be dynamically generated), with a little more verbosity on what results you're hoping for and why, maybe some of us can more expeditiously point you towards a viable solution.	[reply]
Re^2: Regex Subexpressions by BenjiSmith (Novice) on Sep 09, 2005 at 23:29 UTC
Yeah, sorry for changing the post around. I thought I was replying to myself when in fact I was editing the original post. Anyhow, here's a more concrete example: `@keywordList = ('john', 'john.smith', 'john.smith@mail.com'); $combinedExpression = combine(@keywordList); # The combined expression looks something like this: # (john(?:\.smith(?:\@mail\.com)?)?) $searchText = "john's username is john.smith and his email address is +john.smith@mail.com"; while $searchText =~ /$combinedExpression/g { print "$1\n"; }` [download] For this example, I expect to get these results: john john john.smith john john.smith john.smith@mail.com Essentially, for every occurrence of every one of my keywords, I need to get a result, even if those keywords occur within other keywords in the input text.	[reply] [d/l]
Re^3: Regex Subexpressions by grinder (Bishop) on Sep 10, 2005 at 08:35 UTC
For one possible implementation of your `combine` subroutine, consider: `use Regexp::Assemble; sub combine { my $str = Regexp::Assemble->new->add(@_)->as_string; qr/($str)/; }` [download] - another intruder with the mooring in the heart of the Perl	[reply] [d/l]
Re^2: Regex Subexpressions by BenjiSmith (Novice) on Sep 09, 2005 at 23:47 UTC
Great looking solutions you guys, but I actually have a few additional design constraints: 1. No perl code embedded in the regex. After demonstration of the prototype in perl, it will be implemented in the product using Java, so it must be compatible with the Java regex engine. 2. The regex should only have to be compiled once, so no rewriting of the regex string after starting to iterate through the matches.	[reply]
Re^3: Regex Subexpressions by liverpole (Monsignor) on Sep 10, 2005 at 00:09 UTC
Whoops -- there's you're last reply. I didn't see it when I was answering your last one. I'm going to have to throw in the towel -- what you're asking for is out of my league! Isn't there some way you could do the equivalent thing in Java code though? Anyway, good luck!	[reply]
Re: Regex Subexpressions by chester (Hermit) on Sep 09, 2005 at 23:14 UTC
To start, I assume the regex you meant was `/(a(bc?)?)/g` because yours as posted doesn't match 'ab', though it does match 'ac'. Forgive me if I'm misinterpreting. I came up with what's below, but I'm not at all sure it works in the general case. You can play with it. It gives your desired output, using my regex. It saves a possible match, then fails the regex on purpose to start backtracking to get all the possible matches. `use strict; use warnings; my $test = 'abc'; my @matches; $test =~ /(a(bc?)?)(??{push @matches, $^N})(?!)/; print "@matches\n";` [download] If you can generate the regexes, it would probably be easier just to generate separate regexes.	[reply] [d/l] [select]
Re^2: Regex Subexpressions by liverpole (Monsignor) on Sep 09, 2005 at 23:24 UTC
To produce the output which BenjiSmith is looking for, I would make one modification, s/push/unshift/: `#!/usr/bin/perl -w use strict; use warnings; my $test = 'abc'; my @matches; $test =~ /(a(bc?)?)(??{unshift @matches, $^N})(?!)/; print "@matches\n";` [download] Here's another "solution", which may not be at all correct (it matches an input string of "aaa", for example), but does it shed any more light on what the solution should be?: `#!/usr/bin/perl -w use strict; use warnings; my $test = "abc"; my $count = 1; while ($test =~ m/([abc]{$count})/) { my $match = $1; ++$count; printf "%s\n", $match; }` [download]	[reply] [d/l] [select]
Re: Regex Subexpressions by liverpole (Monsignor) on Sep 09, 2005 at 23:56 UTC
Okay, how about this for a solution? It doesn't do it all in a single regex (hey, go easy on me; I'm only at monk level), but it does do what I think it is you're asking for ... #!/usr/bin/perl -w use strict; use warnings; # Prototypes sub findMatches($$); # Input data my @keywordList = ( 'john', 'john.smith', 'john.smith@mail.com' ); # Main program my $searchText = "john's username is john.smith and his email address +is john.smith\@mail.com"; my $pmatches = findMatches($searchText, \@keywordList); map { print "$_\n"; } @$pmatches; # Subroutines # # Inputs: $1 ... the text string to match against # $2 ... a pointer to the list of valid matching substrings # # Outputs: $1 ... a pointer to a list of all matches # sub findMatches($$) { my ($text, $plist) = @_; my @matches; foreach my $pattern (@$plist) { while ($text =~ /($pattern)/g) { my $result = $1; # Got a match in $1 push @matches, $result; # Save it to our master res +ults list # Now trim off the first character (otherwise we'll be mat +ching # against the same substring (think 'deep recursion'), and + call # this subroutine again for recursively generated sub-matc +hes. # Whatever we get (if anything) is added to the list. # $result =~ s/^.//; my $psubstr = findMatches($result, $plist); push @matches, @$psubstr; } } return \@matches; } [download] By the way, you're a java programmer, right? I'm not being perjorative or anything (well, maybe a little :-), but I had never heard of a "reluctant qualifier", and when I googled for it, I got mostly Java matches. That plus the lowerCaseMixedWithUpperCase variable names kinda gave you away ... ;->	[reply] [d/l]
Re^2: Regex Subexpressions by BenjiSmith (Novice) on Sep 10, 2005 at 00:49 UTC
Yep, I'm a java programmer. I've been programming in Java for about five years, but I've also been writing Perl for the last three years. I just picked up Python a few weeks ago, and I really like it a lot. I've also done brief stints with VBA, D, C++, and x86 Assembly. But I know that--when I have a difficult regex problem--the people at perlmonks are more likely to be able to figure out a solution than anyone else.	[reply]
Re^3: Regex Subexpressions by merlyn (Sage) on Sep 10, 2005 at 01:05 UTC
But I know that--when I have a difficult regex problem--the people at perlmonks are more likely to be able to figure out a solution than anyone else. I consider that a misuse, perhaps even an abuse, of the openness here. That'd be like saying "I know that most people here probably have apple computers, so I'll post my apple questions here". Seeing questions about Java, with Java's restrictions (even if it is regex), is not the reason I come here. Your task was simple with Perl's regex. Java's regex are different, and thus the answers weren't relevant, but you didn't disclose that at the beginning. Please mark your questions "off topic" next time, so I can ignore them. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Regex Subexpressions by Roy Johnson (Monsignor) on Sep 10, 2005 at 02:45 UTC
Are you looking for: `@matches = m/(((a)b)c)/ and print "@matches\n";` [download] ? Caution: Contents may have been coded under pressure.	[reply] [d/l]
Re^2: Regex Subexpressions by liverpole (Monsignor) on Sep 10, 2005 at 03:02 UTC
Wow! Now that's a great answer! Let me just add the following, since it took me a few minutes to see exactly what you were doing, as well as making it conform to the original requirements for the output: `$_ = "abc"; # Regex match of /.../ uses $_ when =~ is unspecified # Nicely solves the problem stated, and demonstrates list assignment # to a regex capture. # @matches = m/(((a)b)c)/ and print join "\n", reverse @matches;` [download] I realized a "reverse" was necessary, and then immediately fell into the trap of putting it before the "join". I like your answer a lot for its succinct approach!	[reply] [d/l]
Re^3: Regex Subexpressions by ikegami (Patriarch) on Sep 10, 2005 at 03:52 UTC
Does your solution have to find "ab" in "abd"? If so, the above won't work.	[reply]
Re: Regex Subexpressions by ikegami (Patriarch) on Sep 10, 2005 at 03:48 UTC
Do you have to use just one regexp? The following will work in Java, and is only compiled once: `my @regexps = ( qr/a/, qr/ab/, qr/abc/, ); my $text = 'aababc'; my @matches; foreach ($text) { # Safe $_ = $text; foreach my $re (@regexps) { push(@matches, /($re)/g); } } print("$_\n") foreach @matches;` [download] `a (at pos 0) a (at pos 1) a (at pos 3) ab (at pos 0) ab (at pos 3) abc (at pos 3)` [download] Remove the "g" from "/($re)/g" to match each regexp only once.	[reply] [d/l] [select]