Micz has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a question. The company I work for has started to have quite long IDs. I hear people repeating 4-3-4-5-3-1-7-5-4-6 over the phone all the time.

I would like to write a little script which reduces a code like the above to an alphanumeric combination. E.g. 4345317546 = DA5BG. And DA5BG must only be 4345317546, so I can convert this backwards.

Is there a pre-written function to do this in Perl? I tried to program an MD5-similar function on my own, but the results were just as long and not shorter.

Thanks, Jan

Replies are listed 'Best First'.
Re: Shorter ID Codes
by kvale (Monsignor) on Oct 28, 2004 at 18:05 UTC
    First of all, you are talking about converting (I assume) a decimal number, which has 10 values per character, to a combination of digits and uppcase letters, which has 36 values per digit. So the best compression ratio you can expect is log_10(36) or about 1.56. The compression ratio of 2 you give as an example isn't possible unless the ID structure has some sparsity that you can exploit.

    So the procedure is to convert base 10 number to base 36, and then assign number and letters. Let's take 0-9 to be just 0-9 and 10-35 to be A-Z. Then the encoding step would be

    use Math::BaseCalc; $calc36 = new Math::BaseCalc(digits=>[0..9,'A'..'Z']; $calc10 = new Math::BaseCalc(digits=>[0..9]); $in_base_36 = $calc36->to_base($calc10->from_base('4345317546') ); $in_base_10 = $calc10->to_base( $calc36->from_base('DA5BG') );

    -Mark

      The sparcity could come from asserting that no id numbers lower than 4345317546 will be produced anymore so you can substract that base number and end up with really short things like '1' -> 'ZZZ....'.
Re: Shorter ID Codes
by PodMaster (Abbot) on Oct 28, 2004 at 17:54 UTC
    perldoc -f pack
    warn pack 'H*', 4345317546 ; die unpack 'H*', CE1uF;

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      I can't be the only experienced perl hacker who finds pack() and unpack() to be a complete headfuck. I can never remember which of them to use, and can never figure out exactly what template to pass to them.

      Does anyone know of a good tutorial?

Re: Shorter ID Codes
by tilly (Archbishop) on Oct 28, 2004 at 17:57 UTC
    See the thread convert numbers to a different (arbitrary) base.

    Note that you won't save that much. If you use all of the digits, lower case letters and upper case letters, then 4345317546 becomes 4K4vAS. If you just use digits and upper case letters it becomes 1ZV38H6. If you avoid vowels (to avoid various 4-letter words) then you save even less space.

Re: Shorter ID Codes
by grinder (Bishop) on Oct 28, 2004 at 18:00 UTC
    Is there a pre-written function to do this in Perl?

    Yes there is. Someone asked a similar question a week ago and I suggested that they use Math::BaseCalc.

    You just have to choose the characters you want to use, for instance 0..9 and A..Z give you 36 digit values, and so you could convert your base-10 ids into base-36. You could go to base-64 but then you'd have to say "uppercase g, lowercase r" which would be a net loss.

    The process is reversable, just convert from base 36 back to base 10.

    - another intruder with the mooring of the heat of the Perl

      Of course if you use A..Z then you may introduce confusion between 0 => O and I => 1.

      Update: Good point below, although you could mitigate that by reading it back using a phonetic alphabet (papa delta bravo tango), but again that'd take longer.

        I'm going to echo this and add that purely numbered IDs are much easier to hear. There is tremendous aural confusion around B & D & P & T etc, even among speakers with the same accent. You might find that while 4242 4242 4242 4242 takes longer to read it's a false economy to switch it out with something like PDBT.

Re: Shorter ID Codes
by ikegami (Patriarch) on Oct 28, 2004 at 19:08 UTC
    Is shortening the number worth the hassle of dealing with similiar characters (one vs lowercase L, zero vs uppercase o, etc) and similar sounding characters ('v' vs 'b', 'm' vs 'n', etc).

    btw, here's the savings you'd get:

    1 base10 digit = 1 base64 digit = 0% saving
    2 base10 digits = 2 base64 digits = 0% saving
    3 base10 digits = 2 base64 digits = 25% saving
    4 base10 digits = 3 base64 digits = 25% saving
    5 base10 digits = 3 base64 digits = 40% saving
    6 base10 digits = 4 base64 digits = 33% saving
    7 base10 digits = 4 base64 digits = 43% saving
    8 base10 digits = 5 base64 digits = 38% saving
    9 base10 digits = 5 base64 digits = 44% saving
    10 base10 digits = 6 base64 digits = 40% saving
    11 base10 digits = 7 base64 digits = 36% saving
    12 base10 digits = 7 base64 digits = 42% saving

    And since you mentioned MD5... Hashing algorithms (e.g. MD5) are lossful. I suspect you need to be able to retrieve the unencoded digits, so hashing algorithms won't help you here.

Re: Shorter ID Codes
by BrowserUk (Patriarch) on Oct 28, 2004 at 18:07 UTC

    Converting to base 64 will give you short, readable, reversible values.

    use Math::BaseCnv; print cnv( 4345317546, 10, 64 ); 4304Yg print cnv( '4304Yg', 64, 10 ); 4345317546

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      4304Yg isn't vocally readable. Here's how the conversation would go:

      4 3 oh... no, zero. Yes, 4 3 zero 4 why gee. ...oh, that's upper case Y, and lower case g.


      Dave

Re: Shorter ID Codes
by Eimi Metamorphoumai (Deacon) on Oct 28, 2004 at 18:19 UTC
    Another approach might be not to minimize for number of characters, but for syllables. For instance, "VX" is two syllables, while "vex" is only one. So if you're looking for easy ways to produce readable IDs, you could look at something like Digest::BubbleBabble (which isn't meant to be reversable, but might be a starting point).
      As an example, I've written a bit of code here. The result is almost always longer in actual characters, but considerably shorter in syllables. I've restricted the consonants to try to stay with pronouncable monosyllables. Basically, each word is a combination of a consonant cluster from $consonants1, a vowel from $vowels, and a final cluster from $consonants2. You can add anything you want to those, as long as they stay aurally distinct. Or if you find some of them are hard to pronounce or keep distinct, you could remove any that you have trouble saying. As given, it'll convert "4345317546" to "stut-prip-bluv" and back.
      #!/usr/bin/perl -l use strict; use warnings; my @consonants1 = qw/ b bl br ch d dr f fr g gr h j k kl kr l m n p pl pr r s sk skl skr sl sn sp spl spr st str t tr v vr z /; my @vowels = qw/ a e i o u /; my @consonants2 = qw/ b bs d dge ds f g gs ck ll lf m n nd nt p r s st t v x z /; my $nwords = 0; my @num2word; my %word2num; for my $c1 (@consonants1){ for my $v (@vowels){ for my $c2 (@consonants2){ $num2word[$nwords] = "$c1$v$c2"; $word2num{"$c1$v$c2"} = $nwords; $nwords++; } } } while(<>){ chomp; if (/^\d+$/){ my @words; #encode while ($_ != 0){ push @words, $num2word[$_ % $nwords]; $_ = int ($_ / $nwords); } print join "-", @words; } else { #decode my $num = 0; for my $word (reverse split /-/){ $num *= $nwords; $num += $word2num{$word}; } print $num; } }
      And just one more thought, now that I've had a bit longer to think about it. Instead of generating "words" like that, you might just get a dictionary of, say, the 10000 most frequently used English words and map each one to a 4 digit number. That way you'd be dealing with real words instead of made up words, which might be easier to understand (although you'd probably want to remove any homophones from the list, as well as anything that sounds substantially similar). Then break your number into four-digit substrings (or however is natural, if it's a phone number break it into 3-3-4, for instance) and translate each one. Again, makes the result longer, but easier to remember and communicate.