nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a regex to extract phone numbers from a string.

my $string = "Example 1: 413-577-1234 Example 2: 981-413-777-8888 Example 3: 413.233.2343 Example 4: 562-3113 Example 5: 401 311 7898 Example 6: 2342343-23878-878-2343 Example 7: 1 (413) 555-2378 Example 8: 4135552378 Example 9: 413 789 8798 343 9878"; extract($string); sub extract { my %results = (); my $string = shift; # pad string with spaces to make it easier to find phone numbers at + beginning and end of strings $string = ' ' . $string . ' '; # get rid of consecutive whitespace characters to make regex easier +and faster $string =~ s/(\s){2,}/$1/g; # find patterns in the string that look like phone numbers my @matches = $string =~ / # Look for ten digit North American numbers (?:1(?:\.|\s|-))* # optional 1 followed by pe +riod OR whitespace or dash \(? # optional opening paren \d{3} # three consecutive digits (?:\.|\)|-|\s|\)\s)? # optional punctuation (per +iod, close paren, dash, whitespace) \d{3} # three consecutive digits (?:\.|-|\s)? # optional punctuation \d{4} # 4 consecutive digts | # Look for seven digit North American numbers \d{3} # three consecutive digits (?:\.|-|\s)? # optional punctuation \d{4} # 4 consecutive digts /gx; say "Match: '$_'" for @matches; }

This outputs:

Match: '413-577-1234' Match: '1-413-777-8888' Match: '413.233.2343' Match: '562-3113' Match: '401 311 7898' Match: '2342343' Match: '878-878-2343' Match: '1 (413) 555-2378' Match: '4135552378' Match: '413 789 8798' Match: '343 9878'

As you can see, my regex does not handle examples #2, #6 and #9 properly. When there are long strings of digits, it finds phone number where there are none. How can I modify the regex to avoid these false positives? I've tried a few different ways but none of them succeeded.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Regex for extracting phone numbers from string
by afoken (Chancellor) on Apr 11, 2016 at 06:02 UTC
    I'm trying to write a regex to extract phone numbers from a string. [...] As you can see, my regex does not handle examples #2, #5 and #9 properly. When there are long strings of digits, it finds phone number where there are none. How can I modify the regex to avoid these false positives? I've tried a few different ways but none of them succeeded.

    Are you aware that people write phone numbers in very different ways around the world? Some examples:

    In Germany, "(0 12 34) 5 67 89 - 11" is a very common way to write a business phone number, "0 12 34 / 5 67 89 - 11" is as common. "(0 98 76) 54 32 10 98" may be a private phone number (no extension after "-"), it could also be written as "0 98 76 / 54 32 10 98". In business context, "+49 (0) 12 34 / 5 67 89 - 11" is also used. International contacts would dial the international call prefix, 49 for germany, 12345678911 to reach the contact, german callers would ignore the "+49" and dial 012345678911. Spacing varies wildly, ignoring the rules that digits should be written in pairs. People trying to stand out from the masses add dots, underscores or some other line noise to their phone number. Others just omit all whitespace: +49(0)1234/56789-11.

    Speaking of international call prefixes (the "+" in "+49"): Germany and many other countries use 00, the USA use 011, Australia uses 0011, Cuba uses 119, some former soviet republics use 810, and so on. In GSM networks, you can directly dial "+".

    Vanity numbers seem to be quite popular in some areas of the world. "555-SHOE-SHOP" may be a valid phone number in the US, "(0700) PERLMONK" may be a valid phone number in Germany. Because the masses still do not understand vanity numbers, the latter is usually also written out in digits: "(0700) 73756665".

    000, 110, 112, 117, 118, 144, 911, 999 are unusually short, but still valid phone numbers, all for emergency use in various areas of the world. Not all emergency numbers work everywhere. Germany only uses 110 and 112. 117 and 118 are used in the Swiss, 911 in the USA, 999 and 112 in the UK, 000 in Australia, 144 in Swiss and Austria. France also uses the two-digit numbers 15, 17, and 18 as emergency call numbers.

    Don't assume three digits = emergency use. While most phone numbers in Germany have at least four, usually six to eight digits, my brother has an ancient three-digit phone number, recently migrated to VoIP. As long as those ancient numbers are in use, they can not be revoked or changed. So some people still have and use those exceptionally short numbers in Germany.

    See also Telephone_number, National_conventions_for_writing_telephone_numbers, Re: Trim Phone Numbers.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      As yet another variation, 30 years ago, I lived in a rural area of the US where the population was small enough that, to make calls within the local exchange, you only had to dial the last four digits of the number. I assume that sort of thing is still around.

      And there are obviously rules about which digits can be used and in what sequence, since that's the only way the phone switching system can determine when you've dialed the complete number, so not even every 7- or 10-digit number is a potential phone number. For example, 123-4567 can't be a valid phone number in the US because the leading '1' indicates that you're calling a non-local number.

      I'm writing this script for my own use to pull phone numbers from text that will only have local seven digit numbers or north american phone numbers.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

Re: Regex for extracting phone numbers from string
by Discipulus (Canon) on Apr 11, 2016 at 07:10 UTC
Re: Regex for extracting phone numbers from string
by Corion (Patriarch) on Apr 11, 2016 at 07:32 UTC
Re: Regex for extracting phone numbers from string
by AnomalousMonk (Archbishop) on Apr 11, 2016 at 05:58 UTC
Re: Regex for extracting phone numbers from string
by GrandFather (Saint) on Apr 11, 2016 at 04:26 UTC
    "As you can see, ..."

    No, I can't see. #2 and #9 are rejected and #5 is accepted. The is consistent with the comments in your code.

    Assuming I need it spelled out, what is the problem you see?

    Premature optimization is the root of all job security

      #2 is not rejected. It shows as a match: 1-413-777-8888

      #5 was a typo. I meant #6

      #9 results in two different matches.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

        Anchoring the ends of the match to white space fixes 2 and 6. Harder to fix 9 without knowing what else may be in the string and thus how much "buffer" is needed. However 9 could be fixed in a second test for overall length of a normalized string. Indeed, the whole matching process becomes easier by using a first pass to extract candidate numbers then a second pass to reject numbers that aren't the right length.

        use strict; use warnings; use 5.010; my $string = <<STR; Example 1: 413-577-1234 Example 2: 981-413-777-8888 Example 3: 413.233.2343 Example 4: 562-3113 Example 5: 401 311 7898 Example 6: 2342343-23878-878-2343 Example 7: 1 (413) 555-2378 Example 8: 4135552378 Example 9: 413 789 8798 343 9878 STR extract($string); sub extract { my %results = (); my $string = shift; # pad string with spaces to make it easier to find phone numbers a +t beginning and end of strings $string = ' ' . $string . ' '; # get rid of consecutive whitespace characters to make regex easie +r and faster $string =~ s/(\s){2,}/$1/g; # find patterns in the string that look like phone numbers my @matches = $string =~ / (Example\s\d+:\s) # Anchor the left end of the phone number (?<=\s) # Look for ten digit North American numbers ( (?: 1(?:\.|\s|-))* # optional 1 followed by period OR + whitespace or dash \(? # optional opening paren \d{3} # three consecutive digits (?:\.|\)|-|\s|\)\s)? # optional punctuation (period, cl +ose paren, dash, whitespace) \d{3} # three consecutive digits (?:\.|-|\s)? # optional punctuation \d{4} # 4 consecutive digts | # Look for seven digit North American numbers \d{3} # three consecutive digits (?:\.|-|\s)? # optional punctuation \d{4} # 4 consecutive digts ) # Anchor the right end ofthe phone number (?=\s) /gx; say shift @matches, "match: '", shift @matches, "'" while @matches +; }

        Prints:

        Example 1: match: '413-577-1234' Example 3: match: '413.233.2343' Example 4: match: '562-3113' Example 5: match: '401 311 7898' Example 7: match: '1 (413) 555-2378' Example 8: match: '4135552378' Example 9: match: '413 789 8798'
        Premature optimization is the root of all job security
Re: Regex for extracting phone numbers from string
by Laurent_R (Canon) on Apr 11, 2016 at 06:36 UTC
    There are zillions of ways to recognize a telephone number, but none of them is universal. It all depends on whether you are willing to accept false positive or false negative matches, and how much you know about the various formatting possibilities.

    One possible try is something like this:

    $num = $1 if $string =~ /\d[\d.()\s-]{5-18}/;
    i.e. a leading digit followed by a certain number of digits, spaces, parens, or dashes.
Re: Regex for extracting phone numbers from string
by nysus (Parson) on Apr 11, 2016 at 14:43 UTC

    OK, after sleeping on it I decided to experiment more with positive look-ahead and look-behind assertions. They seem to have done the trick. Here is the new code with new examples:

    my $string = "Example 1: 413-577-1234 Example 2: 981-413-777-8888 Example 3: 413.233.2343 Example 4: 562-3113 Example 5: 401 311 7898 Example 6: 55555-55555-555-5555 Example 7: 1 (413) 555-2378 Example 7: 1(413)666-2378 Example 8: 4135552378 Example 9: 413 789 8798 343 9878 Example 10: 22222222222222222222"; extract($string); sub extract { my $string = shift; # remove extraneous whitespace to simplify regex $string =~ s/(\s){2,}/$1/g; # add double spaces to both ends of string to make it easier to find + phone numbers at beginning and end of strings $string = ' ' . $string . ' '; # find patterns in the string that look like phone numbers my @matches = $string =~ / # Look for ten digit North American numbers (?<=\D\s) # Positive look-behind assertion to avo +id false positives (?:1(?:\.|\s|-)*)* # optional 1 followed by period OR opti +onal whitespace or dash (?<=\s) # Positive look-behind assertion to avo +id false positives (?:\(?\d{3}\) | # optional three digits surrounded by p +arens (?<=\s) # Positive look-behind assertion to avo +id false positives \d{3}) # three consecutive digits (?:\.|-|\s|\)\s)? # optional punctuation (period, dash, w +hitespace) \d{3} # three consecutive digits (?:\.|-|\s)? # optional punctuation \d{4} # 4 consecutive digts (?=\s\D) # Positive look-ahead assertion to av +oid false positives | # Look for seven digit North American numbers (?<=\D\s) # Positive look behind assertion to avoid fa +lse positives \d{3} # three consecutive digits (?:\.|-|\s)? # optional punctuation \d{4} # 4 consecutive digts (?=\s\D) # Positive look-ahead assertion to avoid f +alse positives /gx; # get rid of matches with new lines @matches = grep { index($_, "\n") == -1 } @matches; say "Match: '$_'" for @matches; }

    This outputs six phone numbers which I think any human would agree are in the original input:

    Match: '413-577-1234' Match: '413.233.2343' Match: '562-3113' Match: '401 311 7898' Match: '1 (413) 555-2378' Match: '4135552378'

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate";
    $nysus = $PM . ' ' . $MCF;
    Click here if you love Perl Monks