When I do stuff like this I like to regularize the data by stripping out punctuation that makes things more complicated. In most of the US it's not too hard to determine if something is a phone number-- it will generally have 7,10, or 11 numerical digits (except inside companies' private exchanges and a few small towns like Volcano Village, HI) and some form of separators that depend on where whoever wrote it is from and what mood they were in when they wrote it. I included a little twist for extensions, which are usually appended as x\d+, where there may or may not be a space before the x.
The example below will strip out the punctuation that's around the numbers then check the length of any runs. If it's in the 7 to 11 range I declare it to be a phone number and anything else is part of an address.
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
my @numbers=('(123)456-7890', "222.222.2222", "1-313-345-6798","23-35
+Baker St. Apt 6", "666 666 6666", "123-345.5678", "45 elm street", "1
+23-345.5678x999", "666 666 6666 x233");
foreach my $number (@numbers){
#strip phone number punctuation:
my $address=$number;
$number =~ s/\(?(\d+)[-(). ](\d|x\d)/$1$2/g;
if ($number=~m/\d{7,11}/){
# you could regularize phone number formatting in here
say $number." Phone number";
} else {
say $address." Address";
# process the number as an address
$address =~ m/(\d+)/;
say "address number $1";
}
}
with output
1234567890 Phone number
2222222222 Phone number
13133456798 Phone number
23-35 Baker St. Apt 6 Address
address number 23
6666666666 Phone number
1233455678 Phone number
45 elm street Address
address number 45
1233455678x999 Phone number
6666666666x233 Phone number
Note that I got lazy and didn't bother pulling out all the numbers within an address string, which I let be lengths other than just your 3 & 4 digit runs. I also miss on numbers like 1-(800)-222-2222, but that's just a little more regex tweaking. I don't strip commas, since I don't think I've ever seen commas used to punctuate a US phone number. They might also be your big flag for lists of apt numbers. If you're dealing with phone numbers in Europe you're probably doomed-- they seem to have random numbers of digits over a very large range. |