in reply to problem with regex

Let's translate that to an intelligible question:

Hi all.

The following line in my program is supposed to determine if abc10 appears in a line of text.

my @find = $line[$i] =~ /\tabc[10]\t/g;

I wanted to also check how many times abc1, abc2, abc3, etc. appear in my text. This worked up through abc9, but when I get to double-digit numbers such as abc10 it returns just abc1 again.

How should I go about improving this regular expression to match terms that contain multiple numeric digits following the anchoring substring?

You've got a couple problems: First, "worked well till i did it for abc9" cannot be true, because your regular expression won't match any "abc" sequence unless it is followed by a 0 or a 1. So it's failing for digits 2 .. 9. Let's fix that first:

my @find = $line[$i] =~ /\tabc\d\t/g

The \d metacharacter matches the digits 0 through 9, where your existing character class only matches zero and one.

Now the second problem: Accepting more than one digit following the abc anchor:

my @find = $line[$i] =~ /\tabc\d+\t/g

The + is called a quantifier, and it tells the regular expression engine that it should match "one or more" occurrences of whatever entity precedes it. In this case, that entity is "\d" (a numeric digit). So the two together tell the RE engine to match one or more numeric digits.

Let's test it out. ...seems to work.

But you still may have a problem. Are you sure that abc\d will always be preceded by a tab character? What if it's at the start of the string, or what if it's at the end of the string? Would you want those conditions to fail? (Maybe you would want that, but it's possible that it's just a possibility that hasn't been considered.) Let's modify the regex to deal with that scenario:

my @find = $line[$i] =~ /(?:\A|\t)abc\d+(?:\t|\z)/g

And let's test that out too. What we've done here is we've provided some alternates of choice for the regular expression engine. On the left-hand side, we're saying it's Ok to match either a tab character, or the beginning of the string. On the right-hand side we're using the | (alternation) operator to tell Perl it's Ok to match another tab, or the end of a string.

But are we done? I don't think so. What if "abc3" and "abc4" come right next to each other, separated only by a tab? We probably ought to convert the tab requirement on the right-hand side to a zero-width assertion:

/(?:\A|\t)abc\d+(?=\t|\z)/g

But then there's those pesky tab characters; they're getting captured, and you probably don't need them. You may start thinking, surely there must be an easy way to just get rid of them in the captured output. And you would be right. All you have to do is add capturing parenthesis around the part of the pattern that you're interested in actually capturing. Here's how you would do it:

my @find = $line[$i] =~ /(?:\A|\t)(abc\d+)(?=\t|\z)/g

We'll go ahead and try that too.

But your question stated you're only interested in counting how many times a match occurred. For that, you can just get a count of @find by evaluating it in scalar context. ...at least that's one way to do it.

perlrequick, and perlretut are great places to start in learning to use regular expressions. perlre and perlrecharclass get you into more detail. And Jeffrey Friedl's book, "Mastering Regular Expressions" will attempt to teach you more than you ever thought possible about Regular Expressions (whether you actually learn it is up to you, but the book is the best possible resource available to help you to magnify your aptitude).

Update: Added /g as mentioned by Jenda (It was included in the tests, but inadvertently left off of the code examples shown here.)


Dave

Replies are listed 'Best First'.
Re^2: problem with regex
by Jenda (Abbot) on Jul 22, 2012 at 08:15 UTC

    There should be a "g" at the end of those lines:

    my @find = $line[$i] =~ /(?:\A|\t)(abc\d+)(?=\t|\z)/g;
    otherwise the regexp matches only once instead of searching for all abc<some_number>.

    And if you happen to need to count each of the abc<some_number> separately you can either process the @find:

    my @find = $line[$i] =~ /(?:\A|\t)(abc\d+)(?=\t|\z)/ my %count; $count{$_}++ for @find;

    or skip building the array

    while ($line[i] =~ /(?:\A|\t)(abc\d+)(?=\t|\z)/g) { $count{$1}++; }

    Jenda
    Enoch was right!
    Enjoy the last years of Rome.

      I fixed the /g modifiers. They were in my head, and in the tests I linked to, but failed to find their way through my fingers to the keyboard as I posted the node. ;) Thanks for the catch!


      Dave