Let's translate that to an intelligible question:
Hi all.
The following line in my program is supposed to determine if abc10 appears in a line of text.
my @find = $line[$i] =~ /\tabc[10]\t/g;I wanted to also check how many times abc1, abc2, abc3, etc. appear in my text. This worked up through abc9, but when I get to double-digit numbers such as abc10 it returns just abc1 again.
How should I go about improving this regular expression to match terms that contain multiple numeric digits following the anchoring substring?
You've got a couple problems: First, "worked well till i did it for abc9" cannot be true, because your regular expression won't match any "abc" sequence unless it is followed by a 0 or a 1. So it's failing for digits 2 .. 9. Let's fix that first:
my @find = $line[$i] =~ /\tabc\d\t/g
The \d metacharacter matches the digits 0 through 9, where your existing character class only matches zero and one.
Now the second problem: Accepting more than one digit following the abc anchor:
my @find = $line[$i] =~ /\tabc\d+\t/g
The + is called a quantifier, and it tells the regular expression engine that it should match "one or more" occurrences of whatever entity precedes it. In this case, that entity is "\d" (a numeric digit). So the two together tell the RE engine to match one or more numeric digits.
Let's test it out. ...seems to work.
But you still may have a problem. Are you sure that abc\d will always be preceded by a tab character? What if it's at the start of the string, or what if it's at the end of the string? Would you want those conditions to fail? (Maybe you would want that, but it's possible that it's just a possibility that hasn't been considered.) Let's modify the regex to deal with that scenario:
my @find = $line[$i] =~ /(?:\A|\t)abc\d+(?:\t|\z)/g
And let's test that out too. What we've done here is we've provided some alternates of choice for the regular expression engine. On the left-hand side, we're saying it's Ok to match either a tab character, or the beginning of the string. On the right-hand side we're using the | (alternation) operator to tell Perl it's Ok to match another tab, or the end of a string.
But are we done? I don't think so. What if "abc3" and "abc4" come right next to each other, separated only by a tab? We probably ought to convert the tab requirement on the right-hand side to a zero-width assertion:
/(?:\A|\t)abc\d+(?=\t|\z)/gBut then there's those pesky tab characters; they're getting captured, and you probably don't need them. You may start thinking, surely there must be an easy way to just get rid of them in the captured output. And you would be right. All you have to do is add capturing parenthesis around the part of the pattern that you're interested in actually capturing. Here's how you would do it:
my @find = $line[$i] =~ /(?:\A|\t)(abc\d+)(?=\t|\z)/g
We'll go ahead and try that too.
But your question stated you're only interested in counting how many times a match occurred. For that, you can just get a count of @find by evaluating it in scalar context. ...at least that's one way to do it.
perlrequick, and perlretut are great places to start in learning to use regular expressions. perlre and perlrecharclass get you into more detail. And Jeffrey Friedl's book, "Mastering Regular Expressions" will attempt to teach you more than you ever thought possible about Regular Expressions (whether you actually learn it is up to you, but the book is the best possible resource available to help you to magnify your aptitude).
Update: Added /g as mentioned by Jenda (It was included in the tests, but inadvertently left off of the code examples shown here.)
Dave
In reply to Re: problem with regex
by davido
in thread problem with regex
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |