htmanning has asked for the wisdom of the Perl Monks concerning the following question:
It works, but it also tags numbers within a phone number 555-555-5555. How can I make it work only if there is a space before the 4 digits? That would preclude it from being recognized in a string of numbers such as dates and phone numbers. Thanks.my $digits_4 = qr{ \b \d{4} \b }xms; $text =~ s{ ($digits_4) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg; my $digits_3 = qr{ \b \d{3} \b }xms; $text =~ s{ ($digits_3) } {<a href="resident-info.pl?do_what=view&unit=$1"><b>$1</b></ +a>}xmsg;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Recognizing 3 and 4 digit number
by kcott (Archbishop) on Jan 02, 2017 at 08:06 UTC | |
G'day htmanning, Rather than drip-feeding us additional requirement changes, it would be much better if you started with something like this:
All of those tests were successful (output in spoiler):
This helps both you and us. You can add examples of representative input and the wanted output. There's a clear indication of the test data used along with expected and actual results. You can add new tests if necessary; tweak the regex if required; and ensure previous tests still pass. If you run into difficulties, we have all the information we need to provide immediate help. You get a faster, useful response and we don't have the frustration of an ever changing specification. As I said above, all of those tests were successful. If my test data is fully representative of your data, and my expectations match yours, then you may have a solution. However, if you have other use cases (the more likely scenario), modify the code above, change the regex if need be, and get back to us if you have further problems. Here's some notes on your code and what I did differently.
See also: "perlre: Lookaround Assertions" and "perlrecharclass: Bracketed Character Classes". — Ken | [reply] [d/l] [select] |
|
Re: Recognizing 3 and 4 digit number
by BrowserUk (Patriarch) on Jan 02, 2017 at 01:10 UTC | |
I can't help but think there is more to this requirement than you've specifed, but based on what you've asked for, +a little bit more, try:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by htmanning (Friar) on Jan 02, 2017 at 01:20 UTC | |
| [reply] |
by 1nickt (Canon) on Jan 02, 2017 at 02:12 UTC | |
If you don't want the space in the substitution, don't include it in the capture group!
The way forward always starts with a minimal test.
| [reply] [d/l] |
by Anonymous Monk on Jan 02, 2017 at 02:05 UTC | |
| [reply] |
|
Re: Recognizing 3 and 4 digit number
by davido (Cardinal) on Jan 02, 2017 at 16:44 UTC | |
Just use negative look-arounds to ensure that you do not have a digit on either side of a 3 or 4 digit number:
The output:
An advantage of using negative lookarounds is that you don't have to explicitly accommodate conditions such as the start or end of string or line. The negative lookarounds are just saying "a digit cannot come immediately before or after a sequence of 3 or 4 digits". With positive lookarounds you would have to say "either a non-digit or end of string must come before and after a sequence of 3 or 4 digits." That would look something like this (untested): m/(?<^|\D)(\d{3,4})(?=$|\D)/mgSo rather than asserting what must come before and after the digits, the regexp becomes simpler if we just assert what cannot come before or after. Dave | [reply] [d/l] [select] |
by kcott (Archbishop) on Jan 03, 2017 at 04:46 UTC | |
G'day Dave, At first glance, I thought your regex was better than mine and so I decided to try it. I plugged it into my code but it failed on the phone number and date tests (details in spoiler). The OP requirements are not the best but excluding phone numbers and dates seems to be definitely wanted. <Reveal this spoiler or all in this thread>
— Ken | [reply] [d/l] [select] |
by davido (Cardinal) on Jan 03, 2017 at 17:53 UTC | |
Wah! I guess I got excited and answered before noticing that we wanted to disqualify things that look like phone numbers. Sorry. This isn't tested:
But it would run afoul of phone numbers using commas to separate, or wrapping parens around area codes. It might be useful to take a first pass and keep a list of offsets for "numbers" that should be ignored. It's probably easier to match a phone number with existing libraries than to match a 3 or 4 digit number that is not part of a phone number. In other words, on first pass, identify phone numbers, IP addresses, and other problematic numbers, and push their offsets and lengths into an array. Then on second pass disqualify any number that falls within one of the offset/length sets. Dave | [reply] [d/l] |
|
Re: Recognizing 3 and 4 digit number
by AnomalousMonk (Archbishop) on Jan 02, 2017 at 18:49 UTC | |
Update: On second thought, this post is really more like a reply to kcott's Re: Recognizing 3 and 4 digit number and probably should have been posted as such originally. Oh, well... htmanning: My remarks are further to the careful and detailed remarks of kcott here and, I hope, are in the same spirit. I certainly agree with the recommendation (and its rationale) of doing development and posing questions to your fellow monks in a Test::More framework. I tend to differ with kcott in the area of regex best practice. All the following are certainly personal best practices in this area, and are based largely on the regex Perl Best Practices (PBP)s of TheDamian. kcott implies that one should avoid using the /x /m /s modifiers where they are not necessary. I think they are (almost) always necessary: They clarify intent and make it easier to think about what a regex, that most slippery and counterintuitive of things, is doing. When dealing with regexes, the less you have to think about the better. The result is that almost without exception, every qr// m// s/// operator I write ends up with an /xms tail.
The /x modifier allows comments, saviours of sanity, in regexes. kcott suggests the embedded
What if you want literal space characters in your regex when using the /x modifier? I prefer the [ ] usage over the \ usage (which is hard to see and has to be explained: that's The justification for always using the /m /s modifiers is a bit different: They reduce the "degrees of freedom" of regex behavior. What does . (dot) match? "Dot matches everything except a newline except where modified by the /s modifier, in which case it matches everything." That's too much to think about. "Dot matches all" is a lot simpler, and that's what you get with the /s modifier, even if you never use a dot operator. What if you actually want to match "everything but a newline"? Use [^\n] in that case; it does the job and perfectly conveys your intention. I have sometimes seen (?-s:.) and (?s:.) used to invoke the different behaviors of dot. Don't. It's just more potential brain-hurt.
Similarly, the behaviors of the ^ $ operators are With regard to the use of capture groups in qr// operators: This is something else I try assiduously to avoid.
Say you have two Regexp objects $rx $ry with an embedded capture group in each. They might be used in a substitution: Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
by kcott (Archbishop) on Jan 03, 2017 at 03:49 UTC | |
G'day AnomalousMonk, [Your Update just appeared as I hit [reply]. I think your post is fine where it is: htmanning gets a notification of your response with an alternative point of view and you had sent me a /msg anyway, so I was aware of it (thanks for that).] "I tend to differ with kcott in the area of regex best practice." While we certainly differ in some areas, I don't think the gulf is as wide as you suggest. I had originally intended to mention PBP in my post: I had a very long (over an hour) interruption in the middle of typing it and, when I finally returned to it, forgot to include the PBP part. My response below covers the points I wanted to make. I was very impressed with PBP when I first read it over a decade ago — in fact, I read it cover-to-cover twice — and started using most (if not all) of its recommendations in my code. I suspect that, 10 years ago, our views on "regex best practice" may have been perfectly aligned. I still use much of PBP; although, these days, it's just become part of my standard practices and I don't really think of it in terms of following those specific recommendations. One area that I have departed from is adding /msx to the end of every regex. "kcott implies that one should avoid using the /x /m /s modifiers where they are not necessary. I think they are (almost) always necessary: ..." I wasn't trying to imply anything as strong as "should avoid"; rather, my comments were intended to convey something closer to "could avoid". Many organisations have Perl coding standards based on PBP. These are often quite inflexible: "You must write your matches like this: m{...}msx!". On the odd occasion that I've been faced with this, especially for short-term contracts, I just take the pragmatic approach and do it. Unfortunately, many of the programmers have no idea why they're doing this: I consider this to be a real problem. So, use all of those modifiers if your pay packet relies on it, but understand what they do and which are really required for the code being written. I think we're pretty much on the same page with /x, so I'll say no more about that. We definitely seem to be at odds with /m and /s. Perhaps it's a function of the type of data we normally process but I rarely need those: sometimes I need one of them; I need both far less often. There's not a lot more I can say about that: "(almost) always necessary" is not my experience. Using the qr{(?mods:...)} form over the qr{...}mods form is something of a personal preference. I've only been using it for a year or two. The latter form makes the modifiers global: you can't get finer control such as qr{(?mo:...)(?ds:...)} or qr{(?mo:...(?ds:...)...)}. Having said that, my requirements for such fine control are exceptionally limited. I really have no strong feelings regarding which form people choose to use. I don't think your arguments against using qr{(?mods:...)} because of potential typos are particularly compelling: I'm far more likely to not release the Shift key quickly enough and terminate a statement with a colon (and that can be a much harder bug to track down). Whether or not it's a good idea to include captures in qr// is a matter of context: hardly something to be "assiduously" avoided. Where it's used like I did (s/$re/.../), there's no problem. The issue with the OP code was capturing the entire match (s/($re)/.../) when only part of the match was wanted in $1. — Ken | [reply] [d/l] [select] |
by AnomalousMonk (Archbishop) on Jan 03, 2017 at 19:33 UTC | |
We definitely seem to be at odds with /m and /s. ... I rarely need those: sometimes I need one of them; I need both far less often. My motive for always using the /ms modifier cluster (in addition to /x, of course) is to foster clarity, and clarity is always a necessity :) Clarity is improved because the . ^ $ operators have unvarying behaviors. Sometimes one is forced to be devious and must sacrifice clarity of expression, but that's what comments are for! ... the qr{(?mods:...)} form over the qr{...}mods form ... The latter form makes the modifiers global: you can't get finer control such as qr{(?mo:...)(?ds:...)} or qr{(?mo:...(?ds:...)...)}. The docs say this finer control is possible: (?mo-ds) and (?mo-ds:pattern) are rigorously scoped: (Tricky to put together a meaningful example for this!) That said, I would never write regex A as above, but rather as: Don't mess with dot (or ^ $ either): much less potential for brain-hurt. Update: Another version of regex A: In the context of global dot-matches-newline behavior, successive (?-s) and (?s) turn newline matching off and on, respectively. Again, I wouldn't actually write a regex this way unless my feet were being held to the fire. Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] |
by kcott (Archbishop) on Jan 04, 2017 at 00:36 UTC | |
by AnomalousMonk (Archbishop) on Jan 08, 2017 at 23:16 UTC | |
| |
|
Re: Recognizing 3 and 4 digit number
by tybalt89 (Monsignor) on Jan 02, 2017 at 02:06 UTC | |
| [reply] [d/l] |
by htmanning (Friar) on Jan 02, 2017 at 02:20 UTC | |
| [reply] |
by tybalt89 (Monsignor) on Jan 02, 2017 at 02:54 UTC | |
untested... | [reply] [d/l] |
by htmanning (Friar) on Jan 02, 2017 at 03:05 UTC | |
by tybalt89 (Monsignor) on Jan 02, 2017 at 03:40 UTC | |