problem with regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: problem with regex by davido (Cardinal) on Jul 21, 2012 at 22:30 UTC
Let's translate that to an intelligible question: Hi all. The following line in my program is supposed to determine if `abc10` appears in a line of text. `my @find = $line[$i] =~ /\tabc[10]\t/g;` I wanted to also check how many times `abc1`, `abc2`, `abc3`, etc. appear in my text. This worked up through `abc9`, but when I get to double-digit numbers such as `abc10` it returns just `abc1` again. How should I go about improving this regular expression to match terms that contain multiple numeric digits following the anchoring substring? You've got a couple problems: First, "worked well till i did it for abc9" cannot be true, because your regular expression won't match any "`abc`" sequence unless it is followed by a `0` or a `1`. So it's failing for digits `2` .. `9`. Let's fix that first: `my @find = $line[$i] =~ /\tabc\d\t/g` [download] The `\d` metacharacter matches the digits 0 through 9, where your existing character class only matches zero and one. Now the second problem: Accepting more than one digit following the `abc` anchor: `my @find = $line[$i] =~ /\tabc\d+\t/g` [download] The `+` is called a quantifier, and it tells the regular expression engine that it should match "one or more" occurrences of whatever entity precedes it. In this case, that entity is "`\d`" (a numeric digit). So the two together tell the RE engine to match one or more numeric digits. Let's test it out. ...seems to work. But you still may have a problem. Are you sure that `abc\d` will always be preceded by a tab character? What if it's at the start of the string, or what if it's at the end of the string? Would you want those conditions to fail? (Maybe you would want that, but it's possible that it's just a possibility that hasn't been considered.) Let's modify the regex to deal with that scenario: `my @find = $line[$i] =~ /(?:\A\|\t)abc\d+(?:\t\|\z)/g` [download] And let's test that out too. What we've done here is we've provided some alternates of choice for the regular expression engine. On the left-hand side, we're saying it's Ok to match either a tab character, or the beginning of the string. On the right-hand side we're using the `\|` (alternation) operator to tell Perl it's Ok to match another tab, or the end of a string. But are we done? I don't think so. What if "abc3" and "abc4" come right next to each other, separated only by a tab? We probably ought to convert the tab requirement on the right-hand side to a zero-width assertion: `/(?:\A\|\t)abc\d+(?=\t\|\z)/g` But then there's those pesky tab characters; they're getting captured, and you probably don't need them. You may start thinking, surely there must be an easy way to just get rid of them in the captured output. And you would be right. All you have to do is add capturing parenthesis around the part of the pattern that you're interested in actually capturing. Here's how you would do it: `my @find = $line[$i] =~ /(?:\A\|\t)(abc\d+)(?=\t\|\z)/g` [download] We'll go ahead and try that too. But your question stated you're only interested in counting how many times a match occurred. For that, you can just get a count of @find by evaluating it in scalar context. ...at least that's one way to do it. perlrequick, and perlretut are great places to start in learning to use regular expressions. perlre and perlrecharclass get you into more detail. And Jeffrey Friedl's book, "Mastering Regular Expressions" will attempt to teach you more than you ever thought possible about Regular Expressions (whether you actually learn it is up to you, but the book is the best possible resource available to help you to magnify your aptitude). Update: Added /g as mentioned by Jenda (It was included in the tests, but inadvertently left off of the code examples shown here.) Dave	[reply] [d/l] [select]
Re^2: problem with regex by Jenda (Abbot) on Jul 22, 2012 at 08:15 UTC
There should be a "g" at the end of those lines: `my @find = $line[$i] =~ /(?:\A\|\t)(abc\d+)(?=\t\|\z)/g;` [download] otherwise the regexp matches only once instead of searching for all abc<some_number>. And if you happen to need to count each of the abc<some_number> separately you can either process the @find: `my @find = $line[$i] =~ /(?:\A\|\t)(abc\d+)(?=\t\|\z)/ my %count; $count{$_}++ for @find;` [download] or skip building the array `while ($line[i] =~ /(?:\A\|\t)(abc\d+)(?=\t\|\z)/g) { $count{$1}++; }` [download] Jenda Enoch was right! Enjoy the last years of Rome.	[reply] [d/l] [select]
Re^3: problem with regex by davido (Cardinal) on Jul 22, 2012 at 08:30 UTC
I fixed the `/g` modifiers. They were in my head, and in the tests I linked to, but failed to find their way through my fingers to the keyboard as I posted the node. ;) Thanks for the catch! Dave	[reply] [d/l]
Re: problem with regex by ww (Archbishop) on Jul 22, 2012 at 00:52 UTC
davido's (almost) exhaustive answer ++ should give you plenty to chew on... for this and many other regex issues. But I read your goal a bit differently -- based on your apparent failure to fully understand code tags (your last line needs them around abc10 to make it read as something other than a link). Since it is a link, I'm going to work on the assumption you really meant what you said, above, `abc10`. ... and davido's answer covers that... but if you want your regex to match a line which contains exactly `abc10` and nothing else, then you have this additional option: See below. ~~`=~ /^abc10/$`~~ (where "$" is another way to send EOL) On another hand, if you'd like to match `abc10`, `abc01` and `abc11`, you can use a character class (that's what a set of chars/digits inside square brackets is called): `#!/usr/bin/perl use 5.014; my @ar=('abc10', 'abc11', 'abc01', 'abc02', 'abc111'); for $_( @ar ) { if( $_ =~ /^abc[10]{2}/Z/ ) # wrong; see below { say $_; } }` [download] where the "`^`" is another way to say, 'match at the start of the string' and the curly-wrapped "2" is a quantifier -- match a 1 or a 0 and do so exactly 2 times. This little script spits out: `983013.pl abc10 abc11 abc01` [download] Update: Two bad screwups in one post. Thanks to AnomalousMonk for msg'ing me about them. First code sample should have the EOL marker, `$`, inside the regex-terminating "/". My bad! Second snippet, line seven, should be `\Z` as it was in the code I actually tested... after posting the inaccurate version. i.e., `if( $_ =~ /^abc[10]{2}\Z/ )` Apologies to any who tried to use the advice.	[reply] [d/l] [select]
Re: problem with regex by Kenosis (Priest) on Jul 22, 2012 at 01:06 UTC
davido did an excellent job detailing the step-by-step creation of a general regex that would count your `abc\d+` pattern within a line. However, I get the impression that you'd like a specific count of the number of occurrences for abc1 .. abc10 within a line. davido indicated that your regex would only match abc followed by either a 1 or a 0, so it's not going to match all of abc10. Perhaps I'm over-simplifying the issue, but I'd likely use the following to count the number of times abc10 occurs within a line of text: `my @find = $line[$i] =~ /\babc10\b/g;` [download] This regex will match your abc10 surrounded by a word boundary. Consider the following test for this regex: `use Modern::Perl; my @patterns = map "abc$_", 1 .. 10; my $str = join ' ', map { $_ = int( rand(10) ) + 1; "abc$_" } 1 .. 25; say "String:\n$str\n\nContains:"; say "$_: " . @{ [ $str =~ /\b$_\b/g ] } for @patterns;` [download] Output: `String: abc9 abc6 abc2 abc7 abc3 abc10 abc2 abc10 abc1 abc3 abc2 abc1 abc4 abc +4 abc2 abc4 abc8 abc10 abc10 abc7 abc3 abc10 abc6 abc1 abc8 Contains: abc1: 3 abc2: 4 abc3: 3 abc4: 3 abc5: 0 abc6: 2 abc7: 2 abc8: 2 abc9: 1 abc10: 5` [download] The script first generates a string of 25 random instances of the patterns (abc1 .. abc10) your interested in counting. It ends by looping through each pattern, using the regex (like the above) to provide a count of each occurrence of abc1 .. abc10 within that string. Hope this helps!	[reply] [d/l] [select]