Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: problem with regex
by davido (Cardinal) on Jul 21, 2012 at 22:30 UTC | |
Let's translate that to an intelligible question:
You've got a couple problems: First, "worked well till i did it for abc9" cannot be true, because your regular expression won't match any "abc" sequence unless it is followed by a 0 or a 1. So it's failing for digits 2 .. 9. Let's fix that first:
The \d metacharacter matches the digits 0 through 9, where your existing character class only matches zero and one. Now the second problem: Accepting more than one digit following the abc anchor:
The + is called a quantifier, and it tells the regular expression engine that it should match "one or more" occurrences of whatever entity precedes it. In this case, that entity is "\d" (a numeric digit). So the two together tell the RE engine to match one or more numeric digits. Let's test it out. ...seems to work. But you still may have a problem. Are you sure that abc\d will always be preceded by a tab character? What if it's at the start of the string, or what if it's at the end of the string? Would you want those conditions to fail? (Maybe you would want that, but it's possible that it's just a possibility that hasn't been considered.) Let's modify the regex to deal with that scenario:
And let's test that out too. What we've done here is we've provided some alternates of choice for the regular expression engine. On the left-hand side, we're saying it's Ok to match either a tab character, or the beginning of the string. On the right-hand side we're using the | (alternation) operator to tell Perl it's Ok to match another tab, or the end of a string. But are we done? I don't think so. What if "abc3" and "abc4" come right next to each other, separated only by a tab? We probably ought to convert the tab requirement on the right-hand side to a zero-width assertion: /(?:\A|\t)abc\d+(?=\t|\z)/gBut then there's those pesky tab characters; they're getting captured, and you probably don't need them. You may start thinking, surely there must be an easy way to just get rid of them in the captured output. And you would be right. All you have to do is add capturing parenthesis around the part of the pattern that you're interested in actually capturing. Here's how you would do it:
We'll go ahead and try that too. But your question stated you're only interested in counting how many times a match occurred. For that, you can just get a count of @find by evaluating it in scalar context. ...at least that's one way to do it. perlrequick, and perlretut are great places to start in learning to use regular expressions. perlre and perlrecharclass get you into more detail. And Jeffrey Friedl's book, "Mastering Regular Expressions" will attempt to teach you more than you ever thought possible about Regular Expressions (whether you actually learn it is up to you, but the book is the best possible resource available to help you to magnify your aptitude). Update: Added /g as mentioned by Jenda (It was included in the tests, but inadvertently left off of the code examples shown here.) Dave | [reply] [d/l] [select] |
by Jenda (Abbot) on Jul 22, 2012 at 08:15 UTC | |
There should be a "g" at the end of those lines: otherwise the regexp matches only once instead of searching for all abc<some_number>. And if you happen to need to count each of the abc<some_number> separately you can either process the @find:
or skip building the array
Jenda | [reply] [d/l] [select] |
by davido (Cardinal) on Jul 22, 2012 at 08:30 UTC | |
I fixed the /g modifiers. They were in my head, and in the tests I linked to, but failed to find their way through my fingers to the keyboard as I posted the node. ;) Thanks for the catch! Dave | [reply] [d/l] |
|
Re: problem with regex
by ww (Archbishop) on Jul 22, 2012 at 00:52 UTC | |
But I read your goal a bit differently -- based on your apparent failure to fully understand code tags (your last line needs them around abc10 to make it read as something other than a link). Since it is a link, I'm going to work on the assumption you really meant what you said, above, abc10. ... and davido's answer covers that... but if you want your regex to match a line which contains exactly abc10 and nothing else, then you have this additional option: See below.On another hand, if you'd like to match abc10, abc01 and abc11, you can use a character class (that's what a set of chars/digits inside square brackets is called): where the "^" is another way to say, 'match at the start of the string' and the curly-wrapped "2" is a quantifier -- match a 1 or a 0 and do so exactly 2 times. This little script spits out:
Update: Two bad screwups in one post. Thanks to AnomalousMonk for msg'ing me about them. First code sample should have the EOL marker, $, inside the regex-terminating "/". My bad! Second snippet, line seven, should be \Z as it was in the code I actually tested... after posting the inaccurate version. i.e., if( $_ =~ /^abc[10]{2}\Z/ ) Apologies to any who tried to use the advice. | [reply] [d/l] [select] |
|
Re: problem with regex
by Kenosis (Priest) on Jul 22, 2012 at 01:06 UTC | |
davido did an excellent job detailing the step-by-step creation of a general regex that would count your abc\d+ pattern within a line. However, I get the impression that you'd like a specific count of the number of occurrences for abc1 .. abc10 within a line. davido indicated that your regex would only match abc followed by either a 1 or a 0, so it's not going to match all of abc10. Perhaps I'm over-simplifying the issue, but I'd likely use the following to count the number of times abc10 occurs within a line of text:
This regex will match your abc10 surrounded by a word boundary. Consider the following test for this regex:
Output:
The script first generates a string of 25 random instances of the patterns (abc1 .. abc10) your interested in counting. It ends by looping through each pattern, using the regex (like the above) to provide a count of each occurrence of abc1 .. abc10 within that string. Hope this helps! | [reply] [d/l] [select] |