in reply to how to extract string by possible groupings?
There are six things obviously wrong with your regex:
Regarding grouping and capturing, remember that every pair of parens inside a regex creates a capturing group, and captured substrings are returned in order of appearance (added: as LanX++ beautifully illustrated). Consider the following snippet:
$string = "foo bar"; @match = $string =~ m/(f(oo)) (b(ar))/ print "$match[0]\n"; # prints "foo" (captured by /(f(oo))/ print "$match[1]\n"; # prints "oo" (captured by /(oo)/ print "$match[2]\n"; # prints "bar" (captured by /(b(ar))/ print "$match[3]\n"; # prints "ar" (captured by /(ar)/
Likewise, you seem to think that your @match variable will contain three elements, but as a matter of fact it will contain 8 (eight!): one for every pair of parens in your regex, some of which only surround non-data such as the word "of" or just whitespace \s+.
Don't believe me? Do me a favour and run this snippet (in which I only fixed the \s vs \s+ issue)
use Data::Dumper; while (chomp(my $line = <DATA>)) { @match = $line =~ m/((.*\.c\s+)|(.*\.h\s+)|(.*\.cpp\s+))|(\s+(.*) +\%\s+(of)\s+\d+\s)|(\bNone\b)/; print "$line\n"; print Dumper \@match; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
The output I get:
[... snip ...] test1.cpp 0.00% of 21 0.00% of 16 $VAR1 = [ 'test1.cpp ', undef, undef, 'test1.cpp ', undef, undef, undef, undef [... snip ...]
This neatly demonstrates at least three things:
As for the DRY principle, you violate this for example in the chunk of the regex where you try to capture the file names. What you have written is: "match any number of characters, a literal period, a literal 'c', white space; OR match any number of characters, a literal period, a literal 'cpp', white space space; OR match any (...)" I'm sure you get the pattern.
The way I would have written it, would read as: "match any number of characters, a literal period, one of these literal strings ('c', 'cpp', 'h'), whitespace."
/(.*\.(?:c|cpp|h))\s+/ # Use (?:...) to create a non-capturing group +.
The readability of your script could use some work too. Here's how I would've written it:
# I always start my script with these two lines. # They prevent you from making various mistakes # and make debugging a whole lot easier. use strict; use warnings; # Regular expressions have the tendency to become long # strings of near-undecipherable line noise. To avoid # that, I usually like to split them up in smaller # logical chunks. # In this case, I'd write one regex to capture the # file names and one regex to capture percentages. my $title_re = qr/.*\.(?:c|cpp|h)/; my $percent_re = qr/(?:\d+\.\d+% of \d+|None)/; # Next thing is to combine them into a single # regex to match the input against. # I use the /x modifier so that I can use # white space and comments inside the tegex. my $line_re = qr/ ($title_re) \s+ # Match and capture file names, match whit +espace ($percent_re) \s+ # Match and capture Percent2, match non-da +ta ($percent_re) # Match and capture Percent3 /x; <DATA>; # Read and discard the first line, as this contains non-data. # Read input line by line, cut off newline # characters from the end. while (my $line = <DATA>) { chomp $line; # Match input against the regex, capture # the stuff into separate variables. # I mean, I find a "$title" much more # comprehensible than "$match[0]". my ($title, $percent2, $percent3) = $line =~ $line_re; print "$line\n"; print "Title: $title\n"; print "Percent2: $percent2\n"; print "Percent3: $percent3\n"; print "\n"; } __DATA__ Title Percent2 Percent3 test1.cpp 0.00% of 21 0.00% of 16 test2.c None 16.53% of 484 test3.h 0.00% of 138 None
test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None C:\Users\Lona\Desktop>perl x.pl test1.cpp 0.00% of 21 0.00% of 16 Title: test1.cpp Percent2: 0.00% of 21 Percent3: 0.00% of 16 test2.c None 16.53% of 484 Title: test2.c Percent2: None Percent3: 16.53% of 484 test3.h 0.00% of 138 None Title: test3.h Percent2: 0.00% of 138 Percent3: None
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: how to extract string by possible groupings?
by Laurent_R (Canon) on Jun 02, 2014 at 16:54 UTC | |
by muba (Priest) on Jun 02, 2014 at 18:45 UTC | |
Re^2: how to extract string by possible groupings?
by adrive (Scribe) on Jun 03, 2014 at 02:24 UTC | |
by LanX (Saint) on Jun 03, 2014 at 02:37 UTC | |
by Laurent_R (Canon) on Jun 03, 2014 at 06:49 UTC |