(german) region code detection - request for thoughts

Skeeve has asked for the wisdom of the Perl Monks concerning the following question:

I want to standardize how phone numbers in "my" database are written. Here in Germany it's usually something like 030 1234567 where 030 is the region code for Berlin and 1234567 is the phone number. So we put a space between region code and phone number.

To find out what the region code ist, I have to use a list of all german region codes and could compare them with the phone number in question.

But there are more than 5000 of them in Germany.

So I created a script that sucks in this list of region codes and spits out a regular expression that will match every region code.

Such a regular expression (for a small subset of those numbers) can look like this: (0(?:2(?:0(?:[1238]|4[13])|1(?:29?))|3(?:0|3(?:0(?:[123467]|5[13456])))|4(?:0|1(?:0[123456789]|3[123456789]))))

I think there are ways how I could optimize the expression produced, like making [123456789] in the example above [1-9].

Or making, instead of 1(?:0[123456789]|3[123456789]) this: [13][1-9]

But: Is it worth the effort?

Do you have any idea how to easily implement such enhancements?

But without any further ado… here is my commented code with the small subset of data used for the example above. If you like the full list, you'll find it here (02…04999) and here (05…09999).

Thanks in advance for your time!

#!/usr/bin/perl
use strict;
use warnings;

my @test_numbers;

my $region;
while (<DATA>) {
    # read all region codes
    # remove the leading zero as it is mandatory
    # and also an indicator for a region code
    next unless /^0(\d+)\s+(.*)/g;
    # get the code
    my $rc= $1;
    # and it's city (the city isn't really used. Could be any string)
    my $city= $2;
    # store each region code for testing later
    push(@test_numbers, "x0${rc}x");
    # Build up the subtree
    $region= build_up($region, $rc, $city);
}

#    testcode
#    use Data::Dumper;
#    print Dumper $region;

# create the regular expression
my $re= "(0" . rc2re($region) . ")";
print "$re\n";

# now check all region codes
foreach (@test_numbers) {
    s/$re/-$1-/o;
    # print $_,"\n";
    die "wrong! $_ " unless /^x-/ and /-x$/;
}
print "success!\n";
exit;

sub build_up {
    my($subtree, $rc, $city)= @_;
    # if the region code is empty
    if ( $rc eq '' ) {
        # we return the city if we don't have a subtree yet
        return $city unless $subtree;
        # if we have a subtree, we already have region codes
        # starting with the same sequence
        if (ref $subtree) {
            # So we add a "new digit" (-) for this city
            $subtree->{'-'}= $city;
            return $subtree;
        }
        # otherwise we had the same region code for a different
        # city (which is legal) and we simply extend the city's name
        return "$subtree, $city"; 
    }
    # If the region code isn't empty we create a new subtee
    # if we don't already have one
    if (not defined $subtree) {
        $subtree={};
    }
    # otherwise we might have a city name
    elsif (not ref $subtree) {
        # this is the same case as above but now
        # we already had a complete region code
        # starting with the same digits
        $subtree= { '-' => $subtree };
    }
    
    # We get the first digit
    my $f= substr($rc,0,1);
    # And build it's subtree.
    $subtree->{$f}= build_up($subtree->{$f}, substr($rc,1), $city);
    
    # done
    return $subtree;
}

sub rc2re {
    my($region)= @_;
    # Collect all end digits
    my $ep= join '', sort grep { $_ ne '-' and not ref $region->{$_} }
+ keys %$region;
    # If it's more than 1 make it a character class
    $ep= "[$ep]" if length $ep>1;
    # If we had the special digit "-", make the end digits optional
    $ep= "$ep?" if $region->{'-'};
    
    # Join all regular expressions for all subtrees
    my $mr= join '|', map { $_ . rc2re($region->{$_}) } sort grep { re
+f $region->{$_} } keys %$region;
    # If we had subtrees
    if (length $mr>0) {
        # join them with the end digits (if any)
        $ep.='|' if $ep ne '';
        $ep= "(?:$ep$mr)";
    }
    
    return $ep;
}

__DATA__
0201    Essen, Ruhr
0202    Wuppertal
0203    Duisburg
02041    Bottrop
02043    Gladbeck Westf
0208    Muelheim a.d. Ruhr
0208    Oberhausen Rheinl
0212    Solingen
02129    Haan Rheinl
030    Berlin
03301    Oranienburg
03302    Hennigsdorf
03303    Birkenwerder
03304    Velten
033051    Nassenheide
033053    Zehlendorf Kr Oberhav
033054    Liebenwalde
033055    Kremmen
033056    Muehlenbeck Kr Oberhav
03306    Gransee
03307    Zehdenick
040    Hamburg
04101    Pinneberg
04102    Ahrensburg
04103    Wedel Holst
04104    Aumuehle b Hamburg
04105    Seevetal
04106    Quickborn Kr Pinneber
04107    Siek Kr Stormarn
04108    Rosengarten Kr Harbur
04109    Tangstedt Bz Hamburg
04131    Lueneburg
04132    Amelinghausen
04133    Wittorf Kr Lueneburg
04134    Embsen Kr Lueneburg
04135    Kirchgellersen
04136    Scharnebeck
04137    Barendorf
04138    Betzendorf Kr LÃ¼nebur
04139    Hohnstorf
[download]

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

Comment on (german) region code detection - request for thoughts Select or Download Code

Replies are listed 'Best First'.
Re: (german) region code detection - request for thoughts by betterworld (Curate) on Aug 19, 2008 at 23:09 UTC
Maybe Regexp::Optimizer can help you with this. So we put a space between region code and phone number. ...but some people use "/" or "-" ;-)	[reply]
Re^2: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 06:51 UTC
Maybe some do. But those might not know (or simply ignore) DIN 5008. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: (german) region code detection - request for thoughts by Perlbotics (Archbishop) on Aug 20, 2008 at 00:20 UTC
Hi, what you need is the longest prefix match from a list of variable length strings. One possible approach would be to create a hash of hashes of hashes... where at each level one digit of the given number ~~is a key~~ identifies an edge (of a labeled 10-ary tree?). Another idea would be to look-up the prefixes from a simple hash starting with the longest substring from the phone number to check. Something like this (too tired to clean it up, but it shall convey an idea): #!/bin/perl use strict; use warnings; my %tab; foreach (<DATA>) { chomp; $tab{$1}=$2 if /^\s0(\d+)\s+(.)/; # e.g. 30 -> Berlin # you can compute min/max of the prefix here (s. below) } # already normalised w/o leading 0 NUMBER: foreach my $num ( qw(204112345 3304123456 3145666 301234567) ) + { for (my $len=5; $len>=2; $len--) { # assumes max. prefix lenght is 5, min. is 2 my $prefix = substr($num,0,$len); if (defined (my $town=$tab{$prefix})) { printf "0%-12s = %5s-%-7s in sunny %s\n", $num, "0$prefix", substr($num,$len), $town; next NUMBER; } } print "NO MATCH FOR: 0$num\n"; } __DATA__ 0201 Essen, Ruhr 0202 Wuppertal 0203 Duisburg 02041 Bottrop 02043 Gladbeck Westf 0208 Muelheim a.d. Ruhr 0208 Oberhausen Rheinl 212 Solingen 02129 Haan Rheinl 030 Berlin 03301 Oranienburg 03302 Hennigsdorf 03303 Birkenwerder 03304 Velten 033051 Nassenheide 033053 Zehlendorf Kr Oberhav 033054 Liebenwalde 033055 Kremmen 033056 Muehlenbeck Kr Oberhav 03306 Gransee 03307 Zehdenick [download] This prints: `0204112345 = 02041-12345 in sunny Bottrop 03304123456 = 03304-123456 in sunny Velten NO MATCH FOR: 03145666 0301234567 = 030-1234567 in sunny Berlin` [download] Well, 'sunny' is just wishful thinking... Update: ~~removed~~, added	[reply] [d/l] [select]
Re^2: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 06:58 UTC
The tree is what I already have in my code. The hash is something I didn't want. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^3: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 09:54 UTC
Why would you refuse a hash ? There can't be so much regions as to not easily keep them in memory, in a simple hash like `my %prefixes = ( '04025' = [ 'Region1, Region2, Region3' ], ... );` [download] And afterwards, a straightforward check like `my ($pref5, $pref4, $pref3, $pref2) = map { substr( $phone, 0, $_ ) } (5, 4, 3, 2); my $prefix_length = exists $prefixes{$pref5} ? 5 : exists $prefixes{$pref4} ? 4 : exists $prefixes{$pref3} ? 3 : exists $prefixes{$pref2} ? 2 : 0 ; my $formatted_phone = join( ' ', substr( $phone, 0, $prefix_length), substr( $phone, $prefix_length), );` [download] should work rather very effectively. If you have thought about this already, why do you think it would be expensive/ineffective/inadequate ? Krambambuli ---	[reply] [d/l] [select]
Re^4: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 11:13 UTC
Re^5: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 11:27 UTC
Some notes below your chosen depth have not been shown here
Re^4: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 16:20 UTC
Re^5: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 17:16 UTC
Re: (german) region code detection - request for thoughts by JavaFan (Canon) on Aug 20, 2008 at 00:54 UTC
`[1-9]` is not an optimization over `[123456789]`. Internally, they are the same, the range form is just there for the benefit of the programmer. As for optimizing it, the best thing you could do is to use perl 5.10. That will have anything you're trying to do by hand build in in the regular expression engine. In 5.10, matching `/04105\|04106\|04107/` is as fast as matching `/0410[5-7]/`.	[reply] [d/l] [select]
Re^2: (german) region code detection - request for thoughts by Anonymous Monk on Aug 20, 2008 at 04:40 UTC
benefit of programmer hands (so they get sore from other things)	[reply]
Re: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 05:52 UTC
Is `0208 Muelheim a.d. Ruhr 0208 Oberhausen Rheinl` [download] an error or is it indeed possible to have multiple regions mapped to the same phone prefix ? Krambambuli --- fighting the sandals-without-socks dictature	[reply] [d/l]
Re^2: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 06:56 UTC
Yes. That is valid. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re: (german) region code detection - request for thoughts by jethro (Monsignor) on Aug 20, 2008 at 01:05 UTC
If you can rule out erraneous region codes, you might drop those codes with whatever length has the most occurences (probably 5 or 6) from the list. Any region code you don't find (whether looked up in a hash or checked per regex) must have that length then.	[reply]
Re: (german) region code detection - request for thoughts by JavaFan (Canon) on Aug 20, 2008 at 00:57 UTC
Note that if you just have a list of codes, and a list of numbers you have to map to those codes, your best option would be to use a hash. Put all the codes in the hash, and then with the numbers you are trying to validate, see if the number is in the hash.	[reply]