multiple matches in regex

ww has asked for the wisdom of the Perl Monks concerning the following question:

This grows out an an attempt to improve my fu by building an alternate solution (without using the experimental constructs offered in answers to Need Regex help...but now, my head is bloodied (from banging it against what's become a stone wall). Pray help me to an answer that causes me to slap and flatten that bloody forehead (as in "Oh.... yeah!!!!! How could I have missed that?").

TEST1 is supposed to match entries with BOTH EXT and INT curly delimiters update: so it SHOULD capture 2 matches in Dataline 1 but (per testing and hash contents) it is capturing only the first match in DataLine 1 -- SEE note 2. It correctly captures a single match in DataLine2

NOTE 1: regex in extended format & commented is update in the readmore (was referenced to ww's scratchpad)

NOTE 2: IF this is modified with an initial .* before the capture, the "{three 1 1 {...} {...}}" matches

#!C:/perl/bin -w
use strict;
use Data::Dumper::Simple;
use vars qw ( $i $j $tmp $test @data @found $found %h );
 
$i = 1;     # tracking counter - in full version does NOT necessarily 
+== $j
$j = 0;     # track datalines
%h = (); 
 
while (<DATA>)
   {
      push @data,$_ ;
   }

foreach $test(@data)
   { 
      test1($test);
   } 
 
&hashprint;
exit();
################ subs #############
 
sub test1 {
                  # first, get rid of previous array conents (needed i
+n full vers)
undef(@found);    # Categorized Q&A book answer; "@found = ();" is an 
+alternate
$j++; 
print "\n\t ***** STARTING dataline $j ****\n\n";
 
if ($test =~ /(\{{1}[a-z]{4,5}\s\d\s\d\s[^}]+?\}\s\{[^}]+?\}{2})/gx ) 
+ # see NOTE1
      {
        if ( $1 )
           { 
            $tmp = "\t\$j" . $j . "\$i". $i . ": " . $1 . "\n";
            push @found, $tmp;
            &hashify;
 
            print "TEST1 regex, dataline $j found match(es)\n";
            print "\t $1\n";
           }
           else 
           {
           print "\t TEST1 dataline $j, NO matches.\n";
           }
      } 
$i++;
return($i)
}
# end sub test1
# (more similar tests here in full version)
##############
 
sub hashify { 
$h {'datalinej'.$j."_i".$i} {"j".$j."i" .$i} = "@found"; # get all @fo
+und's elements into hash ???
$tmp = '';
return();
} 
# end hashify
###########
 
sub hashprint { 
print "\n\t\t~~~~~~~~~~~~~~~~ Dumping hash (HoA)\n";
print Dumper (%h);
print "\t\t~~~~~~~~~~~~~~~~ ending dump\n";
} 
# end sub hashprint and end subs
__DATA__
DataLine1 {error 1 1 {^E 0-20-9:0-50-9:0-50-9.*$} {^E 0-20-9:0-50-9:0-
+50-9.*$}} {three 1 1 {^.*35=A.*$|^.*35=5.*$} {^.*35=A.*$|^.*35=5.*$}}
+ {fixv 1 1 ^.*VFIXFxProxy.*Disconnected ^.*VFIXFxProxy.*Disconnected}
DataLine2 error 1 1 {^E 0-20-9:0-50-9:0-50-9.*$} {^E 0-20-9:0-50-9:0-5
+0-9.*$} {three 1 1 {^.*35=A.*$|^.*35=5.*$} {^.*35=A.*$|^.*35=5.*$}} {
+fixv 1 1 ^.*VFIXFxProxy.*Disconnected ^.*VFIXFxProxy.*Disconnected}
[download]

update Output is:

         ***** STARTING dataline 1 ****

TEST1 regex, dataline 1 found match(es)
         {error 1 1 {^E 0-20-9:0-50-9:0-50-9.*$} {^E 0-20-9:0-50-9:0-5
+0-9.*$}}

         ***** STARTING dataline 2 ****

TEST1 regex, dataline 2 found match(es)
         {three 1 1 {^.*35=A.*$|^.*35=5.*$} {^.*35=A.*$|^.*35=5.*$}}

                ~~~~~~~~~~~~~~~~ Dumping hash (HoA)
%h = (
       'datalinej1_i1' => {
                            'j1i1' => ' $j1$i1: {error 1 1 {^E 0-20-9:
+0-50-9:0-50-9.*$} {^E 0-20-9:0-50-9:0-50-9.*$}}
'
                          },
       'datalinej2_i2' => {
                            'j2i2' => ' $j2$i2: {three 1 1 {^.*35=A.*$
+|^.*35=5.*$} {^.*35=A.*$|^.*35=5.*$}}
'
                          }
     );
                ~~~~~~~~~~~~~~~~ ending dump
[download]

and, update per merlyn's note, the commented regex for TEST1 is:

$test =~ /
      (                 # capture
      \{{1}             # SINGLE open curlybrace
      [a-z]{4,5}        # followed by 4 or 5 lowercase letters
      \s\d\s\d\s        # fol by whitespace, digit, whitespace, digit,
+ space
      [^}]+?            # fol by anything_not_a_closingcurly, one-or-m
+ore_times-but-as_few_as_poss
      \}\s\{            # closingcurly fol by whitespace fol by opencu
+rly (tween element 4 and element 5)
      [^}]+?            # fol by anything_not_a_closingcurly, one-or-m
+ore_times-but-as_few_as_poss
      \}{2}             # closingcurly twice
      )/gx;            # end capture, global, xtended syntax, end if_t
+est
[download]

update (approx 0200 UTC, 2005-09-17) And if you've read this far; thanks. I'm now trying to wrap my head around ikegami's and util's replies... after which I'll update again.
Comment on multiple matches in regex
Select or Download Code

Replies are listed 'Best First'.
Re: multiple matches in regex by ikegami (Patriarch) on Sep 16, 2005 at 20:10 UTC
You could always use a parser: `$grammar = <<'__END_OF_GRAMMAR__'; { use strict; use warnings; $skip = qr/[ \t]/; } # Returns a ref to an array of lines. parse : line(s?) /$/ { $item[1] } # Returns a ref to an array of atoms and lists. line : term(s?) /\n/ { $item[1] } # Returns an atom or a list. # An atom looks like: [ atom => "text" ] # A list is a ref to an array of atoms and/or lists. term : '{' term(s?) '}' { [ list => $item[2] ] } \| /[^{} \t\n]+/ { [ atom => $item[1] ] } __END_OF_GRAMMAR__` [download] Update: Added missing closing square brackets. Update: Tested: Read more... (4 kB) Note*: It's quite easy to modify the grammar to generate data in a different format. Let me know if you need help to do so.	[reply] [d/l] [select]
Re: multiple matches in regex by jdporter (Paladin) on Sep 16, 2005 at 20:40 UTC
I think you didn't quite get my point about reducing your code to the smallest possible program that illustrates the problem. You've got so much extraneous data (and code) that it's really hard to see what might be going wrong. It's also not clear what result you're expecting, and you didn't show what result you got. When I run your code, it tells me there are two matches, and your description seems to imply that you expect two... so I'm not even sure you have a problem. (Aside: If you're going to `use vars` all your variables, and call your subroutines with `&`, you may as well not bother with `use strict`.)	[reply] [d/l] [select]
Re: multiple matches in regex by Util (Priest) on Sep 17, 2005 at 00:43 UTC
Here is your question, rephrased (and trimmed down) as best I could decipher. Please clarify if I misunderstood, because this question is what I am answering below. my @data = ( 'Line1 <error 1 1 <foo> <bar>> <three 1 1 <baz> <qux>> <fixv 1 1 qib + qeg>', 'Line2 error 1 1 <foo> <bar> <three 1 1 <baz> <qux>> <fixv 1 1 qib q +eg>', ); # The single match on Line2 is correct, # but I am only getting one match from Line1. # When I work the pattern by hand, # I see that it should match twice on Line2. # How can I get the second match? foreach my $test (@data) { if ($test =~ /(<[a-z]{4,5}\s\d\s\d\s<[^>]+?>\s<[^>]+?>>)/gx ) { print "Found '$1' in line '$test'\n"; } } [download] When you want to capture separate instances of the match multiple times in the same line you use the `/g` modifier, and either: loop over the match in scalar context, like `while ( m/(\d+)/g ) {print $1};` or capture all the matches at once in list context (to be looped over later), like `@matches = m/(\d+)/g;` . You cannot refer to the second match as `$2`, because you only have one set of capturing parens in the pattern! For a simple example, see the section "More matching" in the Perl regular expressions quick start document. Working, tested code: use strict; use warnings; # Set this to 0 or 1; # the output will be exactly the same either way. my $use_scalar_method = 1; # I added the first "open inner brace", # so that the indentation would balance. my $pattern = qr/ \{ # open outer brace [a-z]{4,5} # 4 or 5 lowercase letters \s # \d # single digit \s # \d # single digit \s # \{ # open inner brace [^}]+? # everything up to the close brace \} # close inner brace \s # \{ # open inner brace [^}]+? # everything up to the close brace \} # close inner brace \} # close outer brace /x; my @data = ( 'DataLine1 {error 1 1 {^E 0-20-9:0-50-9:0-50-9.$} {^E 0-20-9:0-50-9 +:0-50-9.$}} {three 1 1 {^.35=A.$\|^.35=5.$} {^.35=A.$\|^.35=5. +$}} {fixv 1 1 ^.VFIXFxProxy.Disconnected ^.VFIXFxProxy.Disconnect +ed}', 'DataLine2 error 1 1 {^E 0-20-9:0-50-9:0-50-9.$} {^E 0-20-9:0-50-9: +0-50-9.$} {three 1 1 {^.35=A.$\|^.35=5.$} {^.35=A.$\|^.35=5.$} +} {fixv 1 1 ^.VFIXFxProxy.Disconnected ^.VFIXFxProxy.Disconnected +}', 'DataLine3 error 1 1 {^E 0-20-9:0-50-9:0-50-9.$} {^E 0-20-9:0-50-9: +0-50-9.$} {twentythreeistoobig 1 1 {^.35=A.$\|^.35=5.$} {^.35=A. +$\|^.35=5.$}} {fixv 1 1 ^.VFIXFxProxy.Disconnected ^.VFIXFxProxy +.Disconnected}', ); my $line_number = 0; foreach my $test (@data) { $line_number++; print "\nLine: $line_number\n"; if ($use_scalar_method) { my $match_number = 0; # Scalar context: # You can get the matches one-at-a-time this way. while ( $test =~ /($pattern)/g ) { $match_number++; print "Match $match_number: $1\n"; } print "No matches.\n" unless $match_number; } else { # List context: # You can get all the matches at once this way. my @matches = ($test =~ /($pattern)/g) or print "No matches.\n" and next; my $match_number = 0; foreach my $match (@matches) { $match_number++; print "Match $match_number: $match\n"; } } } [download]	[reply] [d/l] [select]