comment on

OK, after sleeping on it I decided to experiment more with positive look-ahead and look-behind assertions. They seem to have done the trick. Here is the new code with new examples:

my $string = "Example 1: 413-577-1234
              Example 2: 981-413-777-8888 
              Example 3: 413.233.2343 
              Example 4: 562-3113 
              Example 5: 401 311 7898
              Example 6: 55555-55555-555-5555
              Example 7: 1 (413) 555-2378
              Example 7: 1(413)666-2378
              Example 8: 4135552378
              Example 9: 413 789 8798 343 9878
              Example 10: 22222222222222222222";
extract($string);


sub extract {
  my $string = shift;
  # remove extraneous whitespace to simplify regex
  $string =~ s/(\s){2,}/$1/g;
  # add double spaces to both ends of string to make it easier to find
+ phone numbers at beginning and end of strings
  $string = '  ' . $string . '  ';

  # find patterns in the string that look like phone numbers
  my @matches = $string =~ /


       # Look for ten digit North American numbers
       (?<=\D\s)               # Positive look-behind assertion to avo
+id false positives
       (?:1(?:\.|\s|-)*)*      # optional 1 followed by period OR opti
+onal whitespace or dash 
       (?<=\s)                 # Positive look-behind assertion to avo
+id false positives
       (?:\(?\d{3}\) |         # optional three digits surrounded by p
+arens 
       (?<=\s)                 # Positive look-behind assertion to avo
+id false positives
       \d{3})                  # three consecutive digits
       (?:\.|-|\s|\)\s)?       # optional punctuation (period, dash, w
+hitespace)
       \d{3}                   # three consecutive digits
       (?:\.|-|\s)?            # optional punctuation
       \d{4}                   # 4 consecutive digts
       (?=\s\D)                  # Positive look-ahead assertion to av
+oid false positives

       |                  
       
       # Look for seven digit North American numbers
       
       (?<=\D\s)          # Positive look behind assertion to avoid fa
+lse positives
       \d{3}              # three consecutive digits
       (?:\.|-|\s)?       # optional punctuation
       \d{4}              # 4 consecutive digts
       (?=\s\D)             # Positive look-ahead assertion to avoid f
+alse positives

/gx;

  # get rid of matches with new lines
  @matches = grep { index($_, "\n") == -1 } @matches;
  say "Match: '$_'" for @matches;
}
[download]

This outputs six phone numbers which I think any human would agree are in the original input:

Match: '413-577-1234'
Match: '413.233.2343'
Match: '562-3113'
Match: '401 311 7898'
Match: '1 (413) 555-2378'
Match: '4135552378'
[download]

$PM = "Perl Monk's";
$MCF = "Most Clueless ~~Friar~~ ~~Abbot~~ ~~Bishop~~ ~~Pontiff~~ ~~Deacon~~ Curate";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

In reply to Re: Regex for extracting phone numbers from string by nysus
in thread Regex for extracting phone numbers from string by nysus

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.