Re: phone number parsing refuses to work

Those input files are pretty noisy. If all you need to do is extract and print the phone numbers -- that is, if you don't need to associate each phone number with some name and/or address that's next to it in the data -- then it would help to pre-condition the text so as to eliminate all the stuff you know you don't need, and isolate the potential phone numbers to make them easier to pick out.

Perhaps you can take it for granted that a phone number will never be broken up by a line break (a single line contains one or more complete phone numbers, or contains no relevant data at all). You could also take for granted that all phone numbers use a limited set of punctuation patterns. Here is one possible way to handle the preconditioning:

while (<>)  # read one line at a time
{
    s/[a-z;:\@]+//gi;  # these aren't used for numbers
    s/(?<=\d\)) (?=\d)//g;   # remove space in "\d) \d"

# split the line on whitespace (that's why we got rid of
# any spaces that might be within a given phone number);
# for each thing coming out of the split, print it if it
# looks like a phone number:

    for my $num ( split /\s+/ ) {
        next unless ( $num =~ /\D*(\d{3})\D(\d{3})-(\d{4})\D*/ );
        print "$1-$2-$3\n";
    }
}
[download]

That won't be much use if you do have to preserve information about each phone number along with the number itself -- given the nature of the data, that's a slightly more tricky problem. (But not too tricky... your data is messy, but there are patterns in it that can be used to guide a more intelligent form of data extraction; you use the same sort of approach -- skip or remove things that are not relevant, and use simple patterns to isolate the things that are relevant.)

Comment on Re: phone number parsing refuses to work Download Code