comment on

Those input files are pretty noisy. If all you need to do is extract and print the phone numbers -- that is, if you don't need to associate each phone number with some name and/or address that's next to it in the data -- then it would help to pre-condition the text so as to eliminate all the stuff you know you don't need, and isolate the potential phone numbers to make them easier to pick out.

Perhaps you can take it for granted that a phone number will never be broken up by a line break (a single line contains one or more complete phone numbers, or contains no relevant data at all). You could also take for granted that all phone numbers use a limited set of punctuation patterns. Here is one possible way to handle the preconditioning:

while (<>)  # read one line at a time
{
    s/[a-z;:\@]+//gi;  # these aren't used for numbers
    s/(?<=\d\)) (?=\d)//g;   # remove space in "\d) \d"

# split the line on whitespace (that's why we got rid of
# any spaces that might be within a given phone number);
# for each thing coming out of the split, print it if it
# looks like a phone number:

    for my $num ( split /\s+/ ) {
        next unless ( $num =~ /\D*(\d{3})\D(\d{3})-(\d{4})\D*/ );
        print "$1-$2-$3\n";
    }
}
[download]

That won't be much use if you do have to preserve information about each phone number along with the number itself -- given the nature of the data, that's a slightly more tricky problem. (But not too tricky... your data is messy, but there are patterns in it that can be used to guide a more intelligent form of data extraction; you use the same sort of approach -- skip or remove things that are not relevant, and use simple patterns to isolate the things that are relevant.)

In reply to Re: phone number parsing refuses to work by graff
in thread phone number parsing refuses to work by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.