Re: Shorter ID Codes
by kvale (Monsignor) on Oct 28, 2004 at 18:05 UTC
|
First of all, you are talking about converting (I assume) a decimal number, which has 10 values per character, to a combination of digits and uppcase letters, which has 36 values per digit. So the best compression ratio you can expect is log_10(36) or about 1.56.
The compression ratio of 2 you give as an example isn't possible unless the ID structure has some sparsity that you can exploit.
So the procedure is to convert base 10 number to base 36, and then assign number and letters. Let's take 0-9 to be just 0-9 and 10-35 to be A-Z. Then the encoding step would be
use Math::BaseCalc;
$calc36 = new Math::BaseCalc(digits=>[0..9,'A'..'Z'];
$calc10 = new Math::BaseCalc(digits=>[0..9]);
$in_base_36 = $calc36->to_base($calc10->from_base('4345317546') );
$in_base_10 = $calc10->to_base( $calc36->from_base('DA5BG') );
| [reply] [d/l] |
|
|
The sparcity could come from asserting that no id numbers lower than 4345317546 will be produced anymore so you can substract that base number and end up with really short things like '1' -> 'ZZZ....'.
| [reply] |
Re: Shorter ID Codes
by PodMaster (Abbot) on Oct 28, 2004 at 17:54 UTC
|
warn pack 'H*', 4345317546 ;
die unpack 'H*', CE1uF;
| MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] |
|
|
| [reply] |
|
|
| MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] |
Re: Shorter ID Codes
by tilly (Archbishop) on Oct 28, 2004 at 17:57 UTC
|
See the thread convert numbers to a different (arbitrary) base.
Note that you won't save that much. If you use all of the digits, lower case letters and upper case letters, then 4345317546 becomes 4K4vAS. If you just use digits and upper case letters it becomes 1ZV38H6. If you avoid vowels (to avoid various 4-letter words) then you save even less space. | [reply] |
Re: Shorter ID Codes
by grinder (Bishop) on Oct 28, 2004 at 18:00 UTC
|
Is there a pre-written function to do this in Perl?
Yes there is. Someone asked a similar question a week ago and I suggested that they use Math::BaseCalc.
You just have to choose the characters you want to use, for instance 0..9 and A..Z give you 36 digit values, and so you could convert your base-10 ids into base-36. You could go to base-64 but then you'd have to say "uppercase g, lowercase r" which would be a net loss.
The process is reversable, just convert from base 36 back to base 10.
- another intruder with the mooring of the heat of the Perl
| [reply] |
|
|
Of course if you use A..Z then you may introduce confusion between 0 => O and I => 1.
Update: Good point below, although you could mitigate that by reading it back using a phonetic alphabet (papa delta bravo tango), but again that'd take longer.
| [reply] |
|
|
I'm going to echo this and add that purely numbered IDs are much easier to hear. There is tremendous aural confusion around B & D & P & T etc, even among speakers with the same accent. You might find that while 4242 4242 4242 4242 takes longer to read it's a false economy to switch it out with something like PDBT.
| [reply] |
Re: Shorter ID Codes
by ikegami (Patriarch) on Oct 28, 2004 at 19:08 UTC
|
Is shortening the number worth the hassle of dealing with similiar characters (one vs lowercase L, zero vs uppercase o, etc) and similar sounding characters ('v' vs 'b', 'm' vs 'n', etc).
btw, here's the savings you'd get:
1 base10 digit = 1 base64 digit = 0% saving
2 base10 digits = 2 base64 digits = 0% saving
3 base10 digits = 2 base64 digits = 25% saving
4 base10 digits = 3 base64 digits = 25% saving
5 base10 digits = 3 base64 digits = 40% saving
6 base10 digits = 4 base64 digits = 33% saving
7 base10 digits = 4 base64 digits = 43% saving
8 base10 digits = 5 base64 digits = 38% saving
9 base10 digits = 5 base64 digits = 44% saving
10 base10 digits = 6 base64 digits = 40% saving
11 base10 digits = 7 base64 digits = 36% saving
12 base10 digits = 7 base64 digits = 42% saving
And since you mentioned MD5... Hashing algorithms (e.g. MD5) are lossful. I suspect you need to be able to retrieve the unencoded digits, so hashing algorithms won't help you here.
| [reply] |
Re: Shorter ID Codes
by BrowserUk (Patriarch) on Oct 28, 2004 at 18:07 UTC
|
Converting to base 64 will give you short, readable, reversible values.
use Math::BaseCnv;
print cnv( 4345317546, 10, 64 );
4304Yg
print cnv( '4304Yg', 64, 10 );
4345317546
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] [d/l] |
|
|
| [reply] |
Re: Shorter ID Codes
by Eimi Metamorphoumai (Deacon) on Oct 28, 2004 at 18:19 UTC
|
Another approach might be not to minimize for number of characters, but for syllables. For instance, "VX" is two syllables, while "vex" is only one. So if you're looking for easy ways to produce readable IDs, you could look at something like Digest::BubbleBabble (which isn't meant to be reversable, but might be a starting point). | [reply] |
|
|
As an example, I've written a bit of code here. The result is almost always longer in actual characters, but considerably shorter in syllables. I've restricted the consonants to try to stay with pronouncable monosyllables. Basically, each word is a combination of a consonant cluster from $consonants1, a vowel from $vowels, and a final cluster from $consonants2. You can add anything you want to those, as long as they stay aurally distinct. Or if you find some of them are hard to pronounce or keep distinct, you could remove any that you have trouble saying. As given, it'll convert "4345317546" to "stut-prip-bluv" and back.
#!/usr/bin/perl -l
use strict;
use warnings;
my @consonants1 = qw/ b bl br ch d dr f fr g gr h j k kl kr l m n
p pl pr r s sk skl skr sl sn sp spl spr st
str t tr v vr z /;
my @vowels = qw/ a e i o u /;
my @consonants2 = qw/ b bs d dge ds f g gs ck ll lf m n nd nt p r
s st t v x z /;
my $nwords = 0;
my @num2word;
my %word2num;
for my $c1 (@consonants1){
for my $v (@vowels){
for my $c2 (@consonants2){
$num2word[$nwords] = "$c1$v$c2";
$word2num{"$c1$v$c2"} = $nwords;
$nwords++;
}
}
}
while(<>){
chomp;
if (/^\d+$/){
my @words;
#encode
while ($_ != 0){
push @words, $num2word[$_ % $nwords];
$_ = int ($_ / $nwords);
}
print join "-", @words;
} else {
#decode
my $num = 0;
for my $word (reverse split /-/){
$num *= $nwords;
$num += $word2num{$word};
}
print $num;
}
}
| [reply] [d/l] [select] |
|
|
And just one more thought, now that I've had a bit longer to think about it. Instead of generating "words" like that, you might just get a dictionary of, say, the 10000 most frequently used English words and map each one to a 4 digit number. That way you'd be dealing with real words instead of made up words, which might be easier to understand (although you'd probably want to remove any homophones from the list, as well as anything that sounds substantially similar). Then break your number into four-digit substrings (or however is natural, if it's a phone number break it into 3-3-4, for instance) and translate each one. Again, makes the result longer, but easier to remember and communicate.
| [reply] |