in reply to (german) region code detection - request for thoughts

Hi, what you need is the longest prefix match from a list of variable length strings. One possible approach would be to create a hash of hashes of hashes... where at each level one digit of the given number is a key identifies an edge (of a labeled 10-ary tree?).
Another idea would be to look-up the prefixes from a simple hash starting with the longest substring from the phone number to check.

Something like this (too tired to clean it up, but it shall convey an idea):

#!/bin/perl use strict; use warnings; my %tab; foreach (<DATA>) { chomp; $tab{$1}=$2 if /^\s*0(\d+)\s+(.*)/; # e.g. 30 -> Berlin # you can compute min/max of the prefix here (s. below) } # already normalised w/o leading 0 NUMBER: foreach my $num ( qw(204112345 3304123456 3145666 301234567) ) + { for (my $len=5; $len>=2; $len--) { # assumes max. prefix lenght is 5, min. is 2 my $prefix = substr($num,0,$len); if (defined (my $town=$tab{$prefix})) { printf "0%-12s = %5s-%-7s in sunny %s\n", $num, "0$prefix", substr($num,$len), $town; next NUMBER; } } print "NO MATCH FOR: 0$num\n"; } __DATA__ 0201 Essen, Ruhr 0202 Wuppertal 0203 Duisburg 02041 Bottrop 02043 Gladbeck Westf 0208 Muelheim a.d. Ruhr 0208 Oberhausen Rheinl 212 Solingen 02129 Haan Rheinl 030 Berlin 03301 Oranienburg 03302 Hennigsdorf 03303 Birkenwerder 03304 Velten 033051 Nassenheide 033053 Zehlendorf Kr Oberhav 033054 Liebenwalde 033055 Kremmen 033056 Muehlenbeck Kr Oberhav 03306 Gransee 03307 Zehdenick

This prints:

0204112345 = 02041-12345 in sunny Bottrop 03304123456 = 03304-123456 in sunny Velten NO MATCH FOR: 03145666 0301234567 = 030-1234567 in sunny Berlin
Well, 'sunny' is just wishful thinking...

Update: removed, added

Replies are listed 'Best First'.
Re^2: (german) region code detection - request for thoughts
by Skeeve (Parson) on Aug 20, 2008 at 06:58 UTC

    The tree is what I already have in my code.

    The hash is something I didn't want.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Why would you refuse a hash ?

      There can't be so much regions as to not easily keep them in memory, in a simple hash like

      my %prefixes = ( '04025' = [ 'Region1, Region2, Region3' ], ... );

      And afterwards, a straightforward check like
      my ($pref5, $pref4, $pref3, $pref2) = map { substr( $phone, 0, $_ ) } (5, 4, 3, 2); my $prefix_length = exists $prefixes{$pref5} ? 5 : exists $prefixes{$pref4} ? 4 : exists $prefixes{$pref3} ? 3 : exists $prefixes{$pref2} ? 2 : 0 ; my $formatted_phone = join( ' ', substr( $phone, 0, $prefix_length), substr( $phone, $prefix_length), );
      should work rather very effectively. If you have thought about this already, why do you think it would be expensive/ineffective/inadequate ?

      Krambambuli
      ---

        I like one regexp match more than several substring comparisons. And I didn't want a "huge" array in my code. Just one "simple" regex. My module for matching region codes and international country codes is 9K while the region codes alone are 32K.

        32K is not a huge size nowadays, but I'm dated back from the ages of the C64 ;-)

        Your solution is quite clever but would need some enhancements to provide for

        1. Different minimal lengths of region codes
        2. Different maximum lengths of region codes
        3. It sholud find those values on it's own

        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        krambambuli wrote:
        should work rather very effectively.

        I was unsure about that and so I benchmarked.

        I used the full list of 5132 region codes. Have no fear! There are not 32K of region codes following, just the (about) 9K of my regular expression which I use to generate the region code list and also the test data.

        This is the result:
        Rate Skeeve krambambuli Skeeve 10.7/s -- -30% krambambuli 15.4/s 43% --

        So regular expressions seem to be very efficient. The code is 30% faster. ;-) It isn't. krambambuli is right.


        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e