Recognizing 3 and 4 digit number

htmanning has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 02, 2017 at 08:06 UTC
G'day htmanning, Rather than drip-feeding us additional requirement changes, it would be much better if you started with something like this: #!/usr/bin/env perl -l use strict; use warnings; use Test::More; my @tests = ( ['12', '12'], ['123', '>123<>123<'], ['1234', '>1234<>1234<'], ['12345', '12345'], ['123 4567 890', '>123<>123< >4567<>4567< >890<>890<'], ['123 4567 89', '>123<>123< >4567<>4567< 89'], ['123-4567-890', '123-4567-890'], ['01/02/2017', '01/02/2017'], ['2017-01-02T17:01:34', '2017-01-02T17:01:34'], ["12\n345\n6789\n0", "12\n>345<>345<\n>6789<>6789<\n0"], ); plan tests => scalar @tests; my $re = qr{(?x: (?<![/-]) \b ( [0-9]{3,4} ) \b (?![/-]) )}; for my $test (@tests) { my ($string, $exp) = @$test; (my $got = $string) =~ s/$re/>$1<>$1</g; is($got, $exp, "Testing: $string"); } [download] All of those tests were successful (output in spoiler): `1..10 ok 1 - Testing: 12 ok 2 - Testing: 123 ok 3 - Testing: 1234 ok 4 - Testing: 12345 ok 5 - Testing: 123 4567 890 ok 6 - Testing: 123 4567 89 ok 7 - Testing: 123-4567-890 ok 8 - Testing: 01/02/2017 ok 9 - Testing: 2017-01-02T17:01:34 ok 10 - Testing: 12 # 345 # 6789 # 0` [download] This helps both you and us. You can add examples of representative input and the wanted output. There's a clear indication of the test data used along with expected and actual results. You can add new tests if necessary; tweak the regex if required; and ensure previous tests still pass. If you run into difficulties, we have all the information we need to provide immediate help. You get a faster, useful response and we don't have the frustration of an ever changing specification. As I said above, all of those tests were successful. If my test data is fully representative of your data, and my expectations match yours, then you may have a solution. However, if you have other use cases (the more likely scenario), modify the code above, change the regex if need be, and get back to us if you have further problems. Here's some notes on your code and what I did differently. Modifiers You've used a lot of modifiers, most in three places, and most are unnecessary. `x`: you can specify this once, as I did, with `qr{(?x: ... )}`. You could have done the same with `m` & `s` if they were needed (see the next two points). `m`: you haven't used any assertions regarding the start/end of line/string - this one is unnecessary. My last test shows this: it has four lines and substitutions occur correctly on lines 2 and 3. `s`: you haven't used a '`.`' in the regex; this modifier allows '`.`' to (also) match newlines - this one is unnecessary. `g`: this one is fine (although see Source Data below regarding using it twice). See also: "perlre: Modifiers". Captures Instead of wrapping your regex in a capture as part of the substitution, add it to the the regex when created, cf. `qr{... ( [0-9]{3,4} ) ...)` in my code. This would have removed the problem discussed elsewhere in this thread. Source Data You probably don't want two lots of substitutions on the same string (`$text`). In my code, `[0-9]{3,4}` handles all the use cases; of course, you may have other use cases. See also: "perlre: Lookaround Assertions" and "perlrecharclass: Bracketed Character Classes". — Ken	[reply] [d/l] [select]
Re: Recognizing 3 and 4 digit number by BrowserUk (Patriarch) on Jan 02, 2017 at 01:10 UTC
I can't help but think there is more to this requirement than you've specifed, but based on what you've asked for, +a little bit more, try: `/\s\d{3,4}\s/ and print for 'abd 123 fred', '555-5555-6666', 'ab 12345 + xd';; abd 123 fred` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: Recognizing 3 and 4 digit number by htmanning (Friar) on Jan 02, 2017 at 01:20 UTC
I put a backslash s in front of the backslash d and it works, but puts a percent 20 in the url. It still works but there must be another way. Thank you.	[reply]
Re^3: Recognizing 3 and 4 digit number by 1nickt (Canon) on Jan 02, 2017 at 02:12 UTC
If you don't want the space in the substitution, don't include it in the capture group! `$ perl -E' my $re = qr/ \s ( \d{3,4} ) /x; say ">$1<" if " 5678" =~ /$re/; ' >5678<` [download] The way forward always starts with a minimal test.	[reply] [d/l]
Re^3: Recognizing 3 and 4 digit number by Anonymous Monk on Jan 02, 2017 at 02:05 UTC
Impossible :) nothing in that snippet does "value" encoding/escapeing	[reply]
Re: Recognizing 3 and 4 digit number by davido (Cardinal) on Jan 02, 2017 at 16:44 UTC
Just use negative look-arounds to ensure that you do not have a digit on either side of a 3 or 4 digit number: `my @strings = ( 'foo1234bar', '1234 5678', 'abcd9012f123ab', '123', ' 123', '123 ', ); foreach my $string (@strings) { my(@nums) = $string =~ m/(?<!\d)(\d{3,4})(?!\d)/g; local $" = ','; print "<$string>: (@nums)\n"; }` [download] The output: `<foo1234bar>: (1234) <1234 5678>: (1234,5678) <abcd9012f123ab>: (9012,123) <123>: (123) < 123>: (123) <123 >: (123)` [download] An advantage of using negative lookarounds is that you don't have to explicitly accommodate conditions such as the start or end of string or line. The negative lookarounds are just saying "a digit cannot come immediately before or after a sequence of 3 or 4 digits". With positive lookarounds you would have to say "either a non-digit or end of string must come before and after a sequence of 3 or 4 digits." That would look something like this (untested): `m/(?<^\|\D)(\d{3,4})(?=$\|\D)/mg` So rather than asserting what must come before and after the digits, the regexp becomes simpler if we just assert what cannot come before or after. Dave	[reply] [d/l] [select]
Re^2: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 03, 2017 at 04:46 UTC
G'day Dave, At first glance, I thought your regex was better than mine and so I decided to try it. I plugged it into my code but it failed on the phone number and date tests (details in spoiler). The OP requirements are not the best but excluding phone numbers and dates seems to be definitely wanted. <Reveal this spoiler or all in this thread> — Ken	[reply] [d/l] [select]
Re^3: Recognizing 3 and 4 digit number by davido (Cardinal) on Jan 03, 2017 at 17:53 UTC
Wah! I guess I got excited and answered before noticing that we wanted to disqualify things that look like phone numbers. Sorry. This isn't tested: `m/(?<![\d-])(\d{3,4})(?![\d-])/` [download] But it would run afoul of phone numbers using commas to separate, or wrapping parens around area codes. It might be useful to take a first pass and keep a list of offsets for "numbers" that should be ignored. It's probably easier to match a phone number with existing libraries than to match a 3 or 4 digit number that is not part of a phone number. In other words, on first pass, identify phone numbers, IP addresses, and other problematic numbers, and push their offsets and lengths into an array. Then on second pass disqualify any number that falls within one of the offset/length sets. Dave	[reply] [d/l]
Re: Recognizing 3 and 4 digit number by AnomalousMonk (Archbishop) on Jan 02, 2017 at 18:49 UTC
Update: On second thought, this post is really more like a reply to kcott's Re: Recognizing 3 and 4 digit number and probably should have been posted as such originally. Oh, well... htmanning: My remarks are further to the careful and detailed remarks of kcott here and, I hope, are in the same spirit. I certainly agree with the recommendation (and its rationale) of doing development and posing questions to your fellow monks in a Test::More framework. I tend to differ with kcott in the area of regex best practice. All the following are certainly personal best practices in this area, and are based largely on the regex Perl Best Practices (PBP)s of TheDamian. kcott implies that one should avoid using the `/x /m /s` modifiers where they are not necessary. I think they are (almost) always necessary: They clarify intent and make it easier to think about what a regex, that most slippery and counterintuitive of things, is doing. When dealing with regexes, the less you have to think about the better. The result is that almost without exception, every `qr// m// s///` operator I write ends up with an `/xms` tail. The `/x` modifier allows comments, saviours of sanity, in regexes. kcott suggests the embedded `qr{(?x: pattern with whitespace )}` usage where comments are needed. This is undesirable IMHO for two reasons: two opportunities for inadvertent literal spaces before and after the `(?x: ... )` expression, giving you, e.g., `qr{ (?x: pattern with whitespace ) }` and potential brain-hurt. The alternate form `qr{(?x) pattern with whitespace }` is better, but still leaves room for a leading literal space to creep in: `qr{ (?x) pattern with whitespace }` Oops. Just write `qr{ ... }xms` and be done with it. What if you want literal space characters in your regex when using the `/x` modifier? I prefer the `[ ]` usage over the `\` usage (which is hard to see and has to be explained: that's ~~a backslash before a~~ \| an escaped literal space). A string containing literal spaces can be represented as `qr{ \Qstring with some literal spaces\E }xms` The justification for always using the `/m /s` modifiers is a bit different: They reduce the "degrees of freedom" of regex behavior. What does `.` (dot) match? "Dot matches everything except a newline except where modified by the `/s` modifier, in which case it matches everything." That's too much to think about. "Dot matches all" is a lot simpler, and that's what you get with the `/s` modifier, even if you never use a dot operator. What if you actually want to match "everything but a newline"? Use `[^\n]` in that case; it does the job and perfectly conveys your intention. I have sometimes seen `(?-s:.)` and `(?s:.)` used to invoke the different behaviors of dot. Don't. It's just more potential brain-hurt. Similarly, the behaviors of the `^ $` operators are ~~constrained~~ \| expanded by the `/m` modifier. What if you want only their commonly used end-of-string behaviors? The `\A \z \Z` operators were invented for this purpose. With regard to the use of capture groups in `qr//` operators: This is something else I try assiduously to avoid. Say you have two `Regexp` objects `$rx $ry` with an embedded capture group in each. They might be used in a substitution: `$string =~ s{ foo $rx bar $ry baz }{$1$2}xmsg;` If you change the pattern match to `$string =~ s{ foo $ry bar $rx baz }{$1$2}xmsg;` do you also have to change the order of the capture variables `$1 $2` in the replacement string? The problem, of course, is that capture variables correspond in an absolute way to the order of capture groups in the `s///` match. The question is highlighted more sharply if the captures appear explicitly in the `s///` match: `$string =~ s{ foo ($rx) bar ($ry) baz }{$1$2}xmsg;` to `$string =~ s{ foo ($ry) bar ($rx) baz }{$1$2}xmsg; # switch $1 $2 also?` The `\g`n relative back-reference extension of Perl release 5.10 eases the problem of capture group numbering somewhat, but capture group variables are still staunchly absolutist! (The `(?\|alternation\|pattern)` construct of 5.10 also eases the capture group numbering problem a bit.) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 03, 2017 at 03:49 UTC
G'day AnomalousMonk, [Your Update just appeared as I hit [reply]. I think your post is fine where it is: htmanning gets a notification of your response with an alternative point of view and you had sent me a `/msg` anyway, so I was aware of it (thanks for that).] "I tend to differ with kcott in the area of regex best practice." While we certainly differ in some areas, I don't think the gulf is as wide as you suggest. I had originally intended to mention PBP in my post: I had a very long (over an hour) interruption in the middle of typing it and, when I finally returned to it, forgot to include the PBP part. My response below covers the points I wanted to make. I was very impressed with PBP when I first read it over a decade ago — in fact, I read it cover-to-cover twice — and started using most (if not all) of its recommendations in my code. I suspect that, 10 years ago, our views on "regex best practice" may have been perfectly aligned. I still use much of PBP; although, these days, it's just become part of my standard practices and I don't really think of it in terms of following those specific recommendations. One area that I have departed from is adding `/msx` to the end of every regex. "kcott implies that one should avoid using the `/x /m /s` modifiers where they are not necessary. I think they are (almost) always necessary: ..." I wasn't trying to imply anything as strong as "should avoid"; rather, my comments were intended to convey something closer to "could avoid". Many organisations have Perl coding standards based on PBP. These are often quite inflexible: "You must write your matches like this: `m{...}msx`!". On the odd occasion that I've been faced with this, especially for short-term contracts, I just take the pragmatic approach and do it. Unfortunately, many of the programmers have no idea why they're doing this: I consider this to be a real problem. So, use all of those modifiers if your pay packet relies on it, but understand what they do and which are really required for the code being written. I think we're pretty much on the same page with `/x`, so I'll say no more about that. We definitely seem to be at odds with `/m` and `/s`. Perhaps it's a function of the type of data we normally process but I rarely need those: sometimes I need one of them; I need both far less often. There's not a lot more I can say about that: "(almost) always necessary" is not my experience. Using the `qr{(?mods:...)}` form over the `qr{...}mods` form is something of a personal preference. I've only been using it for a year or two. The latter form makes the modifiers global: you can't get finer control such as `qr{(?mo:...)(?ds:...)}` or `qr{(?mo:...(?ds:...)...)}`. Having said that, my requirements for such fine control are exceptionally limited. I really have no strong feelings regarding which form people choose to use. I don't think your arguments against using `qr{(?mods:...)}` because of potential typos are particularly compelling: I'm far more likely to not release the Shift key quickly enough and terminate a statement with a colon (and that can be a much harder bug to track down). Whether or not it's a good idea to include captures in `qr//` is a matter of context: hardly something to be "assiduously" avoided. Where it's used like I did (`s/$re/.../`), there's no problem. The issue with the OP code was capturing the entire match (`s/($re)/.../`) when only part of the match was wanted in `$1`. — Ken	[reply] [d/l] [select]
Re^3: Recognizing 3 and 4 digit number by AnomalousMonk (Archbishop) on Jan 03, 2017 at 19:33 UTC
We definitely seem to be at odds with `/m` and `/s`. ... I rarely need those: sometimes I need one of them; I need both far less often. My motive for always using the `/ms` modifier cluster (in addition to `/x`, of course) is to foster clarity, and clarity is always a necessity :) Clarity is improved because the `. ^ $` operators have unvarying behaviors. Sometimes one is forced to be devious and must sacrifice clarity of expression, but that's what comments are for! ... the `qr{(?mods:...)}` form over the `qr{...}mods` form ... The latter form makes the modifiers global: you can't get finer control such as `qr{(?mo:...)(?ds:...)}` or `qr{(?mo:...(?ds:...)...)}`. The docs say this finer control is possible: `(?mo-ds)` and `(?mo-ds:pattern)` are rigorously scoped: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ((?-s: .+)) .+ ((?-s: .+)) .+ ((?-s: .+)) \z }xms; ;; print qq{B: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: .+)) \z }xms; ;; print qq{C: match, \$1 '$1' @ $-[1]} if $s =~ m{ \A ((?-s: (?s: .+))) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11 C: match, $1 'aa bb cc' @ 0` [download] (Tricky to put together a meaningful example for this!) That said, I would never write regex A as above, but rather as: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A ([^\n]+) .+ ([^\n]+) .+ ([^\n]+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11` [download] Don't mess with dot (or `^ $` either): much less potential for brain-hurt. Update: Another version of regex A: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{aa \n bb \n cc}; ;; print qq{A: match, \$1 '$1' @ $-[1] \$2 '$2' @ $-[2] \$3 '$3' @ $-[ +3]} if $s =~ m{ \A (?-s) (.+) (?s) .+ (?-s) (.+) (?s) .+ (?-s) (.+) \z }xms; " A: match, $1 'aa ' @ 0 $2 ' ' @ 9 $3 'c' @ 11` [download] In the context of global dot-matches-newline behavior, successive `(?-s)` and `(?s)` turn newline matching off and on, respectively. Again, I wouldn't actually write a regex this way unless my feet were being held to the fire. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: Recognizing 3 and 4 digit number by kcott (Archbishop) on Jan 04, 2017 at 00:36 UTC
Re^5: Recognizing 3 and 4 digit number by AnomalousMonk (Archbishop) on Jan 08, 2017 at 23:16 UTC
Some notes below your chosen depth have not been shown here
Re: Recognizing 3 and 4 digit number by tybalt89 (Monsignor) on Jan 02, 2017 at 02:06 UTC
`my $digits_4 = qr{ (?<=\ ) \d{4} \b }xms; my $digits_3 = qr{ (?<=\ ) \d{3} \b }xms;` [download]	[reply] [d/l]
Re^2: Recognizing 3 and 4 digit number by htmanning (Friar) on Jan 02, 2017 at 02:20 UTC
Okay, this worked BUT I just realized I cannot rely on a space to signal a valid number. Sometimes the number starts the text field so there is no space. I'm trying to recognize only those numbers that aren't followed or proceeded by a slash, dash, etc., that would indicate a phone number or date.	[reply]
Re^3: Recognizing 3 and 4 digit number by tybalt89 (Monsignor) on Jan 02, 2017 at 02:54 UTC
`my $digits_4 = qr{ (?<![\/-\ \w]) \d{4} (?![\/-\ \w]) }xms;` [download] untested...	[reply] [d/l]
Re^4: Recognizing 3 and 4 digit number by htmanning (Friar) on Jan 02, 2017 at 03:05 UTC
Re^5: Recognizing 3 and 4 digit number by tybalt89 (Monsignor) on Jan 02, 2017 at 03:40 UTC