Regex for extracting phone numbers from string

nysus has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regex for extracting phone numbers from string by afoken (Chancellor) on Apr 11, 2016 at 06:02 UTC
I'm trying to write a regex to extract phone numbers from a string. [...] As you can see, my regex does not handle examples #2, #5 and #9 properly. When there are long strings of digits, it finds phone number where there are none. How can I modify the regex to avoid these false positives? I've tried a few different ways but none of them succeeded. Are you aware that people write phone numbers in very different ways around the world? Some examples: In Germany, "(0 12 34) 5 67 89 - 11" is a very common way to write a business phone number, "0 12 34 / 5 67 89 - 11" is as common. "(0 98 76) 54 32 10 98" may be a private phone number (no extension after "-"), it could also be written as "0 98 76 / 54 32 10 98". In business context, "+49 (0) 12 34 / 5 67 89 - 11" is also used. International contacts would dial the international call prefix, 49 for germany, 12345678911 to reach the contact, german callers would ignore the "+49" and dial 012345678911. Spacing varies wildly, ignoring the rules that digits should be written in pairs. People trying to stand out from the masses add dots, underscores or some other line noise to their phone number. Others just omit all whitespace: +49(0)1234/56789-11. Speaking of international call prefixes (the "+" in "+49"): Germany and many other countries use 00, the USA use 011, Australia uses 0011, Cuba uses 119, some former soviet republics use 810, and so on. In GSM networks, you can directly dial "+". Vanity numbers seem to be quite popular in some areas of the world. "555-SHOE-SHOP" may be a valid phone number in the US, "(0700) PERLMONK" may be a valid phone number in Germany. Because the masses still do not understand vanity numbers, the latter is usually also written out in digits: "(0700) 73756665". 000, 110, 112, 117, 118, 144, 911, 999 are unusually short, but still valid phone numbers, all for emergency use in various areas of the world. Not all emergency numbers work everywhere. Germany only uses 110 and 112. 117 and 118 are used in the Swiss, 911 in the USA, 999 and 112 in the UK, 000 in Australia, 144 in Swiss and Austria. France also uses the two-digit numbers 15, 17, and 18 as emergency call numbers. Don't assume three digits = emergency use. While most phone numbers in Germany have at least four, usually six to eight digits, my brother has an ancient three-digit phone number, recently migrated to VoIP. As long as those ancient numbers are in use, they can not be revoked or changed. So some people still have and use those exceptionally short numbers in Germany. See also Telephone_number, National_conventions_for_writing_telephone_numbers, Re: Trim Phone Numbers. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re^2: Regex for extracting phone numbers from string by dsheroh (Monsignor) on Apr 11, 2016 at 07:32 UTC
As yet another variation, 30 years ago, I lived in a rural area of the US where the population was small enough that, to make calls within the local exchange, you only had to dial the last four digits of the number. I assume that sort of thing is still around. And there are obviously rules about which digits can be used and in what sequence, since that's the only way the phone switching system can determine when you've dialed the complete number, so not even every 7- or 10-digit number is a potential phone number. For example, 123-4567 can't be a valid phone number in the US because the leading '1' indicates that you're calling a non-local number.	[reply]
Re^2: Regex for extracting phone numbers from string by nysus (Parson) on Apr 11, 2016 at 13:08 UTC
I'm writing this script for my own use to pull phone numbers from text that will only have local seven digit numbers or north american phone numbers. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ Curate"; $nysus = $PM . ' ' . $MCF; Click here if you love Perl Monks	[reply]
Re: Regex for extracting phone numbers from string by Discipulus (Canon) on Apr 11, 2016 at 07:10 UTC
while waiting Regex::Common realeasing their phone number regexes you can use your time reviewing: a-comprehensive-regex-for-phone-number-validation Number::Phones regular-expressions-cookbook good luck! L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply]
Re: Regex for extracting phone numbers from string by Corion (Patriarch) on Apr 11, 2016 at 07:32 UTC
Also see Beast of the Number: Parsing the Feral Phone, which contains various insights	[reply]
Re: Regex for extracting phone numbers from string by AnomalousMonk (Archbishop) on Apr 11, 2016 at 05:58 UTC
The discussions of Recognizing numbers and creating links and Pull 3-digit and 4-digit numbers from string may offer insight. They are concerned with avoiding phone numbers, but you have to recognize a phone number in order to avoid it, and that's what you're trying to do. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: Regex for extracting phone numbers from string by GrandFather (Saint) on Apr 11, 2016 at 04:26 UTC
"As you can see, ..." No, I can't see. #2 and #9 are rejected and #5 is accepted. The is consistent with the comments in your code. Assuming I need it spelled out, what is the problem you see? Premature optimization is the root of all job security	[reply]
Re^2: Regex for extracting phone numbers from string by nysus (Parson) on Apr 11, 2016 at 13:04 UTC
#2 is not rejected. It shows as a match: 1-413-777-8888 #5 was a typo. I meant #6 #9 results in two different matches. $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ Curate"; $nysus = $PM . ' ' . $MCF; Click here if you love Perl Monks	[reply]
Re^3: Regex for extracting phone numbers from string by GrandFather (Saint) on Apr 11, 2016 at 21:05 UTC
Anchoring the ends of the match to white space fixes 2 and 6. Harder to fix 9 without knowing what else may be in the string and thus how much "buffer" is needed. However 9 could be fixed in a second test for overall length of a normalized string. Indeed, the whole matching process becomes easier by using a first pass to extract candidate numbers then a second pass to reject numbers that aren't the right length. use strict; use warnings; use 5.010; my $string = <<STR; Example 1: 413-577-1234 Example 2: 981-413-777-8888 Example 3: 413.233.2343 Example 4: 562-3113 Example 5: 401 311 7898 Example 6: 2342343-23878-878-2343 Example 7: 1 (413) 555-2378 Example 8: 4135552378 Example 9: 413 789 8798 343 9878 STR extract($string); sub extract { my %results = (); my $string = shift; # pad string with spaces to make it easier to find phone numbers a +t beginning and end of strings $string = ' ' . $string . ' '; # get rid of consecutive whitespace characters to make regex easie +r and faster $string =~ s/(\s){2,}/$1/g; # find patterns in the string that look like phone numbers my @matches = $string =~ / (Example\s\d+:\s) # Anchor the left end of the phone number (?<=\s) # Look for ten digit North American numbers ( (?: 1(?:\.\|\s\|-))* # optional 1 followed by period OR + whitespace or dash $? # optional opening paren \d{3} # three consecutive digits (?:\.\|$\|-\|\s\|\)\s)? # optional punctuation (period, cl +ose paren, dash, whitespace) \d{3} # three consecutive digits (?:\.\|-\|\s)? # optional punctuation \d{4} # 4 consecutive digts \| # Look for seven digit North American numbers \d{3} # three consecutive digits (?:\.\|-\|\s)? # optional punctuation \d{4} # 4 consecutive digts ) # Anchor the right end ofthe phone number (?=\s) /gx; say shift @matches, "match: '", shift @matches, "'" while @matches +; } [download] Prints: `Example 1: match: '413-577-1234' Example 3: match: '413.233.2343' Example 4: match: '562-3113' Example 5: match: '401 311 7898' Example 7: match: '1 (413) 555-2378' Example 8: match: '4135552378' Example 9: match: '413 789 8798'` [download] Premature optimization is the root of all job security	[reply] [d/l] [select]
Re: Regex for extracting phone numbers from string by Laurent_R (Canon) on Apr 11, 2016 at 06:36 UTC
There are zillions of ways to recognize a telephone number, but none of them is universal. It all depends on whether you are willing to accept false positive or false negative matches, and how much you know about the various formatting possibilities. One possible try is something like this: `$num = $1 if $string =~ /\d[\d.()\s-]{5-18}/;` [download] i.e. a leading digit followed by a certain number of digits, spaces, parens, or dashes.	[reply] [d/l]
Re: Regex for extracting phone numbers from string by nysus (Parson) on Apr 11, 2016 at 14:43 UTC
OK, after sleeping on it I decided to experiment more with positive look-ahead and look-behind assertions. They seem to have done the trick. Here is the new code with new examples: my $string = "Example 1: 413-577-1234 Example 2: 981-413-777-8888 Example 3: 413.233.2343 Example 4: 562-3113 Example 5: 401 311 7898 Example 6: 55555-55555-555-5555 Example 7: 1 (413) 555-2378 Example 7: 1(413)666-2378 Example 8: 4135552378 Example 9: 413 789 8798 343 9878 Example 10: 22222222222222222222"; extract($string); sub extract { my $string = shift; # remove extraneous whitespace to simplify regex $string =~ s/(\s){2,}/$1/g; # add double spaces to both ends of string to make it easier to find + phone numbers at beginning and end of strings $string = ' ' . $string . ' '; # find patterns in the string that look like phone numbers my @matches = $string =~ / # Look for ten digit North American numbers (?<=\D\s) # Positive look-behind assertion to avo +id false positives (?:1(?:\.\|\s\|-)) # optional 1 followed by period OR opti +onal whitespace or dash (?<=\s) # Positive look-behind assertion to avo +id false positives (?:$?\d{3}$ \| # optional three digits surrounded by p +arens (?<=\s) # Positive look-behind assertion to avo +id false positives \d{3}) # three consecutive digits (?:\.\|-\|\s\|\)\s)? # optional punctuation (period, dash, w +hitespace) \d{3} # three consecutive digits (?:\.\|-\|\s)? # optional punctuation \d{4} # 4 consecutive digts (?=\s\D) # Positive look-ahead assertion to av +oid false positives \| # Look for seven digit North American numbers (?<=\D\s) # Positive look behind assertion to avoid fa +lse positives \d{3} # three consecutive digits (?:\.\|-\|\s)? # optional punctuation \d{4} # 4 consecutive digts (?=\s\D) # Positive look-ahead assertion to avoid f +alse positives /gx; # get rid of matches with new lines @matches = grep { index($_, "\n") == -1 } @matches; say "Match: '$_'" for @matches; } [download] This outputs six phone numbers which I think any human would agree are in the original input: `Match: '413-577-1234' Match: '413.233.2343' Match: '562-3113' Match: '401 311 7898' Match: '1 (413) 555-2378' Match: '4135552378'` [download] $PM = "Perl Monk's"; $MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ Curate"; $nysus = $PM . ' ' . $MCF; Click here if you love Perl Monks	[reply] [d/l] [select]