Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

I want to standardize how phone numbers in "my" database are written. Here in Germany it's usually something like 030 1234567 where 030 is the region code for Berlin and 1234567 is the phone number. So we put a space between region code and phone number.

To find out what the region code ist, I have to use a list of all german region codes and could compare them with the phone number in question.

But there are more than 5000 of them in Germany.

So I created a script that sucks in this list of region codes and spits out a regular expression that will match every region code.

Such a regular expression (for a small subset of those numbers) can look like this: (0(?:2(?:0(?:[1238]|4[13])|1(?:29?))|3(?:0|3(?:0(?:[123467]|5[13456])))|4(?:0|1(?:0[123456789]|3[123456789]))))

I think there are ways how I could optimize the expression produced, like making [123456789] in the example above [1-9].

Or making, instead of 1(?:0[123456789]|3[123456789]) this: [13][1-9]

But: Is it worth the effort?

Do you have any idea how to easily implement such enhancements?

But without any further ado… here is my commented code with the small subset of data used for the example above. If you like the full list, you'll find it here (02…04999) and here (05…09999).

Thanks in advance for your time!

#!/usr/bin/perl use strict; use warnings; my @test_numbers; my $region; while (<DATA>) { # read all region codes # remove the leading zero as it is mandatory # and also an indicator for a region code next unless /^0(\d+)\s+(.*)/g; # get the code my $rc= $1; # and it's city (the city isn't really used. Could be any string) my $city= $2; # store each region code for testing later push(@test_numbers, "x0${rc}x"); # Build up the subtree $region= build_up($region, $rc, $city); } # testcode # use Data::Dumper; # print Dumper $region; # create the regular expression my $re= "(0" . rc2re($region) . ")"; print "$re\n"; # now check all region codes foreach (@test_numbers) { s/$re/-$1-/o; # print $_,"\n"; die "wrong! $_ " unless /^x-/ and /-x$/; } print "success!\n"; exit; sub build_up { my($subtree, $rc, $city)= @_; # if the region code is empty if ( $rc eq '' ) { # we return the city if we don't have a subtree yet return $city unless $subtree; # if we have a subtree, we already have region codes # starting with the same sequence if (ref $subtree) { # So we add a "new digit" (-) for this city $subtree->{'-'}= $city; return $subtree; } # otherwise we had the same region code for a different # city (which is legal) and we simply extend the city's name return "$subtree, $city"; } # If the region code isn't empty we create a new subtee # if we don't already have one if (not defined $subtree) { $subtree={}; } # otherwise we might have a city name elsif (not ref $subtree) { # this is the same case as above but now # we already had a complete region code # starting with the same digits $subtree= { '-' => $subtree }; } # We get the first digit my $f= substr($rc,0,1); # And build it's subtree. $subtree->{$f}= build_up($subtree->{$f}, substr($rc,1), $city); # done return $subtree; } sub rc2re { my($region)= @_; # Collect all end digits my $ep= join '', sort grep { $_ ne '-' and not ref $region->{$_} } + keys %$region; # If it's more than 1 make it a character class $ep= "[$ep]" if length $ep>1; # If we had the special digit "-", make the end digits optional $ep= "$ep?" if $region->{'-'}; # Join all regular expressions for all subtrees my $mr= join '|', map { $_ . rc2re($region->{$_}) } sort grep { re +f $region->{$_} } keys %$region; # If we had subtrees if (length $mr>0) { # join them with the end digits (if any) $ep.='|' if $ep ne ''; $ep= "(?:$ep$mr)"; } return $ep; } __DATA__ 0201 Essen, Ruhr 0202 Wuppertal 0203 Duisburg 02041 Bottrop 02043 Gladbeck Westf 0208 Muelheim a.d. Ruhr 0208 Oberhausen Rheinl 0212 Solingen 02129 Haan Rheinl 030 Berlin 03301 Oranienburg 03302 Hennigsdorf 03303 Birkenwerder 03304 Velten 033051 Nassenheide 033053 Zehlendorf Kr Oberhav 033054 Liebenwalde 033055 Kremmen 033056 Muehlenbeck Kr Oberhav 03306 Gransee 03307 Zehdenick 040 Hamburg 04101 Pinneberg 04102 Ahrensburg 04103 Wedel Holst 04104 Aumuehle b Hamburg 04105 Seevetal 04106 Quickborn Kr Pinneber 04107 Siek Kr Stormarn 04108 Rosengarten Kr Harbur 04109 Tangstedt Bz Hamburg 04131 Lueneburg 04132 Amelinghausen 04133 Wittorf Kr Lueneburg 04134 Embsen Kr Lueneburg 04135 Kirchgellersen 04136 Scharnebeck 04137 Barendorf 04138 Betzendorf Kr LĂĽnebur 04139 Hohnstorf

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Replies are listed 'Best First'.
Re: (german) region code detection - request for thoughts
by betterworld (Curate) on Aug 19, 2008 at 23:09 UTC

    Maybe Regexp::Optimizer can help you with this.

    So we put a space between region code and phone number.

    ...but some people use "/" or "-" ;-)

      Maybe some do. But those might not know (or simply ignore) DIN 5008.

      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: (german) region code detection - request for thoughts
by Perlbotics (Archbishop) on Aug 20, 2008 at 00:20 UTC

    Hi, what you need is the longest prefix match from a list of variable length strings. One possible approach would be to create a hash of hashes of hashes... where at each level one digit of the given number is a key identifies an edge (of a labeled 10-ary tree?).
    Another idea would be to look-up the prefixes from a simple hash starting with the longest substring from the phone number to check.

    Something like this (too tired to clean it up, but it shall convey an idea):

    #!/bin/perl use strict; use warnings; my %tab; foreach (<DATA>) { chomp; $tab{$1}=$2 if /^\s*0(\d+)\s+(.*)/; # e.g. 30 -> Berlin # you can compute min/max of the prefix here (s. below) } # already normalised w/o leading 0 NUMBER: foreach my $num ( qw(204112345 3304123456 3145666 301234567) ) + { for (my $len=5; $len>=2; $len--) { # assumes max. prefix lenght is 5, min. is 2 my $prefix = substr($num,0,$len); if (defined (my $town=$tab{$prefix})) { printf "0%-12s = %5s-%-7s in sunny %s\n", $num, "0$prefix", substr($num,$len), $town; next NUMBER; } } print "NO MATCH FOR: 0$num\n"; } __DATA__ 0201 Essen, Ruhr 0202 Wuppertal 0203 Duisburg 02041 Bottrop 02043 Gladbeck Westf 0208 Muelheim a.d. Ruhr 0208 Oberhausen Rheinl 212 Solingen 02129 Haan Rheinl 030 Berlin 03301 Oranienburg 03302 Hennigsdorf 03303 Birkenwerder 03304 Velten 033051 Nassenheide 033053 Zehlendorf Kr Oberhav 033054 Liebenwalde 033055 Kremmen 033056 Muehlenbeck Kr Oberhav 03306 Gransee 03307 Zehdenick

    This prints:

    0204112345 = 02041-12345 in sunny Bottrop 03304123456 = 03304-123456 in sunny Velten NO MATCH FOR: 03145666 0301234567 = 030-1234567 in sunny Berlin
    Well, 'sunny' is just wishful thinking...

    Update: removed, added

      The tree is what I already have in my code.

      The hash is something I didn't want.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        Why would you refuse a hash ?

        There can't be so much regions as to not easily keep them in memory, in a simple hash like

        my %prefixes = ( '04025' = [ 'Region1, Region2, Region3' ], ... );

        And afterwards, a straightforward check like
        my ($pref5, $pref4, $pref3, $pref2) = map { substr( $phone, 0, $_ ) } (5, 4, 3, 2); my $prefix_length = exists $prefixes{$pref5} ? 5 : exists $prefixes{$pref4} ? 4 : exists $prefixes{$pref3} ? 3 : exists $prefixes{$pref2} ? 2 : 0 ; my $formatted_phone = join( ' ', substr( $phone, 0, $prefix_length), substr( $phone, $prefix_length), );
        should work rather very effectively. If you have thought about this already, why do you think it would be expensive/ineffective/inadequate ?

        Krambambuli
        ---
Re: (german) region code detection - request for thoughts
by JavaFan (Canon) on Aug 20, 2008 at 00:54 UTC
    [1-9] is not an optimization over [123456789]. Internally, they are the same, the range form is just there for the benefit of the programmer.

    As for optimizing it, the best thing you could do is to use perl 5.10. That will have anything you're trying to do by hand build in in the regular expression engine. In 5.10, matching /04105|04106|04107/ is as fast as matching /0410[5-7]/.

      benefit of programmer hands (so they get sore from other things)
Re: (german) region code detection - request for thoughts
by Krambambuli (Curate) on Aug 20, 2008 at 05:52 UTC
    Is
    0208 Muelheim a.d. Ruhr 0208 Oberhausen Rheinl
    an error or is it indeed possible to have multiple regions mapped to the same phone prefix ?

    Krambambuli
    ---
    fighting the sandals-without-socks dictature

      Yes. That is valid.


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: (german) region code detection - request for thoughts
by jethro (Monsignor) on Aug 20, 2008 at 01:05 UTC
    If you can rule out erraneous region codes, you might drop those codes with whatever length has the most occurences (probably 5 or 6) from the list. Any region code you don't find (whether looked up in a hash or checked per regex) must have that length then.
Re: (german) region code detection - request for thoughts
by JavaFan (Canon) on Aug 20, 2008 at 00:57 UTC
    Note that if you just have a list of codes, and a list of numbers you have to map to those codes, your best option would be to use a hash. Put all the codes in the hash, and then with the numbers you are trying to validate, see if the number is in the hash.