Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file of oh..hundreds of phone numbers and their not in any order. So it has a bunch of useless information such as name, address, email addresses, etc. I need a regex that will pull phone numbers of different formats.

For example:

###.###.#### (###) ###-#### (###)###-#### ###.#### ###-#### ......
So I guess I'm looking for a regex that will pull anything that looks like a phone number back and print it in ###-###-#### format. I need it setup to only store unique numbers incase my file has identical numbers, but I can figure that out.
open(FILE "file.txt") or die "oops $!"; my @data = <FILE>; close FILE; my @list; foreach(@data) { if ($_ =~ ## regexes here) { @list = "$_"; } } open (SAVED, "> saved.txt") or die "oops $!"; foreach(@list){ print "$_\n"; }
As you can see, I can do most of it but I can't figure out the regex on something with multiple posibilities. I have like 5+ different ways a phone number can look and I need to find all of them.

Replies are listed 'Best First'.
Re: phone number regex
by arden (Curate) on Mar 10, 2004 at 04:51 UTC
    Why re-invent the wheel. . . look at Number::Phone::US for US phone numbers (which it looks like you want).

    - - arden.

      Unless you don't mind matching a whole load of numbers that either aren't in the US or are impossible, such as 876 444 4444 (Jamaica) and 000 000 0000 (impossible), please don't.
Re: phone number regex
by Old_Gray_Bear (Bishop) on Mar 10, 2004 at 05:54 UTC
    Don't suppose you have any non-US numbers in there, the 3-3-4 pattern is not universal. My Company Phone book application has to cope with 2-5-3 and 2-3-5 as well. And then there is adding the international prefix....

    ----
    I Go Back to Sleep, Now.

    OGB

      And even in the US, not every number follows the 3-3-4 pattern. Local numbers are often given as a 3-4 pattern. Non local numbers may, or may not, have a '1-' prefix. Some numbers are given using letters (1-800-RUN-PERL). And then there are numbers like '911', which don't fit in.

      Abigail

Re: phone number regex
by ysth (Canon) on Mar 10, 2004 at 05:01 UTC
    First of all, you don't want to assign to @list in your loop, since that will get rid of anything put in it from a previous line of the file. push @list, morestuff instead.

    It sounds like you might have more than one number per line, so you'd want to use the //g flag to get more than one match. Your regex should capture the whole numbers with | to alternate between the different patterns, and strip out the extra characters afterward. Then format it as xxx-xxxx or xxx-xxx-xxxx depending on how long it is.

Re: phone number regex
by coec (Chaplain) on Mar 10, 2004 at 05:21 UTC
    Its Ugly But It Works(tm)
    #!/usr/bin/perl -w while (<DATA>) { $num = join "-", split /\D+/; $num =~ s/^-//; print "$num\n"; } __DATA__ (01)1234 5678 (01) 1234 5678 01-1234-5678 01 1234 5678 01 234 4567
    Updated
    Not so ugly now...
Re: phone number regex
by Mr. Muskrat (Canon) on Mar 10, 2004 at 05:06 UTC

    It's not entirely a regex solution but perhaps something like this could be used?

    my %seen; open(FILE, '<', 'file.txt') or die "Unable to open file.txt for readin +g, $!"; while (<FILE>) { chomp; tr/0-9//cd; $_ = sprintf("%010s", $_); $_ =~ s/(\d{3})(\d{3})(\d{4})/$1-$2-$3/; $seen{$_}++; } close(FILE); open(SAVED, '>', 'saved.txt') or die "Unable to open saved.txt for wri +ting, $!"; print SAVED "$_\n" for (sort keys %seen); close(SAVED);

    This will strip all non-digits and pad the left side of the number with zeros if it's less than 10 digits long.
    Updated to add the dashes (how'd I miss that?)

      Not knowing what the input file really looks like (but hearing from the OP that it contains "lots of other junk", like addresses, email, etc), I would tend not to trust this sort of approach. What if some lines have multiple numeric fields, one of which is a phone number? What about a line like "1340 S. 123rd St Apt. 310"? (After deleting all the non-digits, you get something that looks like a phone number.) And so on.

        Details!

        Okay, you caught me napping at the keyboard. I didn't filter the input.

      That works great!! A few questions on this. Every time I run this, I get the number 000-000-0000 on the top of my list even though it's not in my file. Any idea why?

      Also, can you explain what the sprtinf line is doing? I know sprintf does something with numbers but I don't see what this is actually doing. This matches MORE than 10 characters if you find numbers like 123.231.343.343 it would match the ending 343 as well, so I it's not telling how many numbers to match.

      Thanks.

        It's finding a line without any numbers in it. Like I said in my reply to graff, I didn't check the input to see if it matched a phone number.

        The sprintf is creating a string that is a minimum of ten digits long and that is zero padded if the number is less than ten digits. "123.231.343.343" would become "123-231-343343" in the end.

      I tried your code but for some reason it doesn't always work. Here is a snippet of my junk code I need to parse.
      ---------------------------------------------------------------------- +---------- Residential MLS #: 2122894 Status: Active-NORMLS LP: $148,500 SP: $ 3850 Silsby Rd* University Heights* OH* 44118*-3102* Unit/Lot #: * + Area: 1303 Unit Floor #: Map Coordinate: C9D3 Subdivision/Complex: * Photos: Media: 5 Acres: 0.13 1/2 Yr. Tax : 1259 County: Cuyahoga* Owner/Agent: No Parcel ID# (PIN): 722-15-088* Year Built: 1940* Lot Dimensions: 40x14 +0 School District: 1810/Cleveland Hts-Univ Hts City List Type: ERS Irr +egular: N High School: MLS Cross Ref #: Sub Property Type: One Family List Date: 1/4/2004 MT: 55 Directions: Between S.Taylor & Warrensville Center Rds.,south of Cedar + Rd. # Rooms: 6 # Bedrooms: 3* Total Baths: 1.1 Finished SqFt: 1209* LO #/Name: 2710 / Realty One (440) 526-2900 Office Web Site: www.rea +ltyone.com LA #/Name: 450365 / Cindy Czepczynski (440) 582-7119 LA Email: c.cze +pczynski@realtyone.com LA 2 #/Name: / LA 2 Email: SAC: 0 BAC: 3 OAC: *Graduated LockBox Desc: Combination Compensation Explain: 3% on %100,000 plus 2% on remainder. Fixer Upper +: N Remarks: FRESH,CLEAN,UPDATED FROM TOP TO BOTTOM! ROOF'02,FURNACE'03,WI +NDOWS'01,BSMT WATERPROOFED'02,REMODELED KITCHEN'01,REMODELED FBA'03,REMODELED HBA'02,MANY MORE UPDATES! LOCATE +D IN DESIRABLE AREA. COZY WBFP IN FAM RM,WALK-UP ATTIC FOR STORAGE,REC/PLAY AREA IN BSMT,FRESHLY PAINTED THROUGHOUT. MOVE RIG +HT IN!!! Broker Remarks: SUBJECT TO SELLERS FINDING HOME OF CHOICE. ---------------------------------------------------------------------- +---------- Residential MLS #: 2121062 Status: Active-NORMLS LP: $148,500 SP: $ 10761 MEADOWBROOK PARMA HEIGHTS OH 44130- Unit/Lot #: Area: 402 + Unit Floor #: Map Coordinate: C22B3 Subdivision/Complex: Photos: Media: 6 Acres: 1/2 Yr. Tax : 1271 County: Cuyahoga Owner/Agent: Parcel ID# (PIN): 47412006 Year Built: 1965 Lot Dimensions: 50x331 School District: 1824/Parma City List Type: ERS Irregular: N High School: MLS Cross Ref #: Sub Property Type: One Family List Date: 12/13/2003 MT: 77 Directions: OFF YORK ROAD # Rooms: # Bedrooms: 3 Total Baths: 2 Finished SqFt: 1192 LO #/Name: 2269 / Prudential Farina 1st American (440) 888-2300 Offi +ce Web Site: LA #/Name: 332633 / Janice Burton (440) 886-5941 LA Email: jburtonc2 +1@aol.com LA 2 #/Name: / LA 2 Email: SAC: 3 BAC: 3 OAC: *Graduated, Dual LockBox Desc: Compensation Explain: 3% OF $100,000 AND 2 1/2% REMAINDER Fixer Upper: + N Remarks: ALL BRICK, MOVE-IN CONDITION! EAT-IN KITCHEN WITH CERAMIC TIL +E INCLUDES ALL APPLIANCES!FORMAL DINING ROOM! NEWER WINDOWS! SHARP FINISHED BASEMENT W/DECORATIVE FIREPLACE AND BAR! EXERC +ISE ROOM AND LAUNDRY ROOM!VERY PRIVATE, LARGE LOT BACKS TO TRI-C! HOME WARRANTY! BSMT WATERPROOFED W/WARRANTY! WON'T LAST! Broker Remarks: ---------------------------------------------------------------------- +---------- Residential MLS #: 2114484 Status: Active-NORMLS LP: $148,500 SP: $ 33964 Morning Glory Ln North Ridgeville Oh 44039- Unit/Lot #: Are +a: 505 Unit Floor #: 33964 Map Coordinate: L07D2 Subdivision/Complex: Wildflower Photos: Media: 6 Acres: 1/2 Yr. Tax : 1114 County: Lorain Owner/Agent: Parcel ID# (PIN): 07 00 008 704 008 Year Built: 1997 Lot Dimensions: + School District: 4711/North Ridgeville City List Type: ERS Irregular +: N High School: N RIDGEVILLE MLS Cross Ref #: Sub Property Type: Condominium List Date: 10/21/2003 MT: 130 Directions: CENTER RIDGE NORTH ON WILDFLOWER LEFT ON MORNING GLORY # Rooms: 5 # Bedrooms: 2 Total Baths: 2 Finished SqFt: 1372 LO #/Name: 2802 / Smythe, Cramer Co. (440) 888-5353 Office Web Site: + www.smythecramer.com LA #/Name: 438395 / Robert Miller (440) 979-5783 LA Email: rmiller@s +mythecramer.com LA 2 #/Name: / LA 2 Email: SAC: 0 BAC: 3 OAC: *Graduated LockBox Desc: Compensation Explain: 3%/1ST $100k,2% REMAINDER Fixer Upper: N Remarks: ELEGANT ATTACHED RANCH CONDO! CORIAN COUNTERS CERAM TILE & NE +WER CARPETING, VAULTED & 9`CEILINGS & A FRESH, NEUTRAL DECOR. OWNERS SUITE W/FBA & WALK-IN. THE CONVENIENCE OF A COVERED ENTR +Y, 2CAR ATT & LAUNDRY ROOM. THE EXTERIOR HAS BEEN ENHANCED BY A RAISED & FENCED DECK & GARDENS. YOU`LL LOVE IT! Broker Remarks: NO SHOWINGS BEFORE NOON ---------------------------------------------------------------------- +---------- Prepared by: Mary Ann Zahand / (440) 878-6296 Information is Believed + To Be Accurate But Not Guaranteed Date Printed: Fri, Feb 27, 2004 ---------------------------------------------------------------------- +---------- Residential MLS #: 2113958 Status: Active-NORMLS LP: $148,500 SP: $ 1490 BENNETT RD MADISON OH 44057- Unit/Lot #: Area: 1122 Unit Floor #: Map Coordinate: L2A2 Subdivision/Complex: Photos: Media: 7 Acres: 0.58 1/2 Yr. Tax : 1130 County: Lake Owner/Agent: Parcel ID# (PIN): 01B102000034 Year Built: 1989 Lot Dimensions: 90X28 +0 School District: 4303/Madison Local List Type: ERS Irregular: N High School: MADISON MLS Cross Ref #: Sub Property Type: One Family List Date: 10/17/2003 MT: 134 Directions: RT 20-N ON BENNETT-N OF MADISON AVE W/S # Rooms: 7 # Bedrooms: 3 Total Baths: 2 Finished SqFt: 1600 LO #/Name: 2832 / Smythe, Cramer Co. (440) 428-1818 Office Web Site: + www.smythecramer.com LA #/Name: 272843 / Mary Ann Hubbard (440) 223-7653 LA Email: maryan +nhubbard@alltel.net LA 2 #/Name: / LA 2 Email: SAC: 0 BAC: 3.0 OAC: *Graduated LockBox Desc: Combination Compensation Explain: 3.0% ON 1ST $100,000 & 2& AFTER Fixer Upper: N Remarks: YOU'LL BE PLEASANTLY SURPIRSIED!/FRESHLY PAINTED, SOFTLY DECO +RATED, OPEN & AIRY/DRAMATIC LR W/VAULTED CEILING, FP & INDIRECT LIGHTING/MSTR SUITE W/PRIV. BATH & WALK IN CLOSET/SPARKLING W +HITE KIT. W/TONS OF CABINETS, ALL APP'L, BREAKFAST AREA/COZY DEN/BEAUTIFUL LANDSCAPING, TREED LOT Broker Remarks: ---------------------------------------------------------------------- +----------
      It skips nearly all phone numbers and picks up nearly everything that's NOT a phone number using your example. Any suggestions?

        I don't generally give fully functional programs out at the drop of a hat. The snippet that I did provide lacks a good input filter, which was pointed out in the replies to it.

Re: phone number regex
by etcshadow (Priest) on Mar 10, 2004 at 06:32 UTC
    Here ya go:
    push(@list, "($1) $2-$3") if /\(?(\d{3})\)?\s*[-.]*\s*(\d{3})\s*[-.]*\ +s*(\d{4})/;
    ------------ :Wq Not an editor command: Wq