in reply to phone number parsing refuses to work
Perhaps you can take it for granted that a phone number will never be broken up by a line break (a single line contains one or more complete phone numbers, or contains no relevant data at all). You could also take for granted that all phone numbers use a limited set of punctuation patterns. Here is one possible way to handle the preconditioning:
That won't be much use if you do have to preserve information about each phone number along with the number itself -- given the nature of the data, that's a slightly more tricky problem. (But not too tricky... your data is messy, but there are patterns in it that can be used to guide a more intelligent form of data extraction; you use the same sort of approach -- skip or remove things that are not relevant, and use simple patterns to isolate the things that are relevant.)while (<>) # read one line at a time { s/[a-z;:\@]+//gi; # these aren't used for numbers s/(?<=\d\)) (?=\d)//g; # remove space in "\d) \d" # split the line on whitespace (that's why we got rid of # any spaces that might be within a given phone number); # for each thing coming out of the split, print it if it # looks like a phone number: for my $num ( split /\s+/ ) { next unless ( $num =~ /\D*(\d{3})\D(\d{3})-(\d{4})\D*/ ); print "$1-$2-$3\n"; } }
|
---|