Re: (german) region code detection - request for thoughts

Hi, what you need is the longest prefix match from a list of variable length strings. One possible approach would be to create a hash of hashes of hashes... where at each level one digit of the given number ~~is a key~~ identifies an edge (of a labeled 10-ary tree?).
Another idea would be to look-up the prefixes from a simple hash starting with the longest substring from the phone number to check.

Something like this (too tired to clean it up, but it shall convey an idea):

#!/bin/perl

use strict;
use warnings;

my %tab;
foreach (<DATA>) {
        chomp;
        $tab{$1}=$2  if /^\s*0(\d+)\s+(.*)/; # e.g. 30 -> Berlin
        # you can compute min/max of the prefix here (s. below)
}

                            # already normalised w/o leading 0
NUMBER: foreach my $num ( qw(204112345 3304123456 3145666 301234567) )
+ {  
            for (my $len=5; $len>=2; $len--) { 
                 # assumes max. prefix lenght is 5, min. is 2
                 my $prefix = substr($num,0,$len);
                 if (defined (my $town=$tab{$prefix})) {
                     printf "0%-12s = %5s-%-7s in sunny %s\n", 
                              $num,  "0$prefix", 
                              substr($num,$len), $town;
                     next NUMBER;
                 }
            }
            print "NO MATCH FOR: 0$num\n";
}

__DATA__
0201    Essen, Ruhr
0202    Wuppertal
0203    Duisburg
02041    Bottrop
02043    Gladbeck Westf
0208    Muelheim a.d. Ruhr
0208    Oberhausen Rheinl
212    Solingen
02129    Haan Rheinl
030    Berlin
03301    Oranienburg
03302    Hennigsdorf
03303    Birkenwerder
03304    Velten
033051    Nassenheide
033053    Zehlendorf Kr Oberhav
033054    Liebenwalde
033055    Kremmen
033056    Muehlenbeck Kr Oberhav
03306    Gransee
03307    Zehdenick
[download]

This prints:

0204112345    = 02041-12345   in sunny Bottrop
03304123456   = 03304-123456  in sunny Velten
NO MATCH FOR: 03145666
0301234567    =   030-1234567 in sunny Berlin
[download]

Well, 'sunny' is just wishful thinking...

Update: ~~removed~~, added

Comment on Re: (german) region code detection - request for thoughts Select or Download Code

Replies are listed 'Best First'.
Re^2: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 06:58 UTC
The tree is what I already have in my code. The hash is something I didn't want. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^3: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 09:54 UTC
Why would you refuse a hash ? There can't be so much regions as to not easily keep them in memory, in a simple hash like `my %prefixes = ( '04025' = [ 'Region1, Region2, Region3' ], ... );` [download] And afterwards, a straightforward check like `my ($pref5, $pref4, $pref3, $pref2) = map { substr( $phone, 0, $_ ) } (5, 4, 3, 2); my $prefix_length = exists $prefixes{$pref5} ? 5 : exists $prefixes{$pref4} ? 4 : exists $prefixes{$pref3} ? 3 : exists $prefixes{$pref2} ? 2 : 0 ; my $formatted_phone = join( ' ', substr( $phone, 0, $prefix_length), substr( $phone, $prefix_length), );` [download] should work rather very effectively. If you have thought about this already, why do you think it would be expensive/ineffective/inadequate ? Krambambuli ---	[reply] [d/l] [select]
Re^4: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 11:13 UTC
I like one regexp match more than several substring comparisons. And I didn't want a "huge" array in my code. Just one "simple" regex. My module for matching region codes and international country codes is 9K while the region codes alone are 32K. 32K is not a huge size nowadays, but I'm dated back from the ages of the C64 ;-) Your solution is quite clever but would need some enhancements to provide for Different minimal lengths of region codes Different maximum lengths of region codes It sholud find those values on it's own `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^5: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 11:27 UTC
Re^6: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 15:22 UTC
Re^4: (german) region code detection - request for thoughts by Skeeve (Parson) on Aug 20, 2008 at 16:20 UTC
krambambuli wrote: should work rather very effectively. I was unsure about that and so I benchmarked. I used the full list of 5132 region codes. Have no fear! There are not 32K of region codes following, just the (about) 9K of my regular expression which I use to generate the region code list and also the test data. Read more... (24 kB) This is the result: `Rate Skeeve krambambuli Skeeve 10.7/s -- -30% krambambuli 15.4/s 43% --` [download] ~~So regular expressions seem to be very efficient. The code is 30% faster.~~ ;-) It isn't. krambambuli is right. `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^5: (german) region code detection - request for thoughts by Krambambuli (Curate) on Aug 20, 2008 at 17:16 UTC