How can I create unique integers to represent each string?
It's easy. The character representation of a string already is a unique numeric representation of that string.
C:\test>perl -nle"printf qq[%-20s : %u\n], $_, unpack 'Q', pack 'Z8',
+$_" words.txt
aa : 24929
aah : 6840673
aahed : 431198069089
aahing : 113723912511841
aahs : 1936220513
aal : 7102817
aalii : 452740276577
aaliis : 126896577470817
aals : 1936482657
aardvark : 32195308464267617
aardvarks : 32195308464267617
aardwolf : 30521856061759841
aardwolves : 30521856061759841
aargh : 448412148065
aarrgh : 114793511018849
aarrghh : 29388191088927073
aas : 7561569
aasvogel : 28542701074080097
aasvogels : 28542701074080097
ab : 25185
aba : 6382177
abaca : 418279154273
abacas : 126862116348513
abaci : 452638892641
aback : 461228827233
abacterial : 32199697902953057
abacus : 126948015694433
abacuses : 28555920663470689
abaft : 499933864545
abaka : 418413372001
abakas : 126862250566241
abalone : 28550397486522977
abalones : 28550397486522977
...
Of course, the problem is that using the largest integers easily available to us -- 64-bit -- we can only accommodate the first 8 bytes of the string. After that, the value of the extra characters no longer fit into the 64-bit representation and so words like aardvark and aardvarks are indistinguishable.
There are modules on cpan that allow us to use larger integers -- salva has 128 & 256-bit integer modules -- which would double and quadruple the length of strings we can uniquely accommodate, but then the second problem becomes obvious. The numeric representations take up just as much space as the strings they represent. So nothing is gained.
Another possibility arises if the alphabet used for the strings is smaller than the character code representation it uses. Ie. If the string only use (say) lower-case alpha characters and digits, then those 36 characters can be represented in 6-bits, so it becomes possible to pack the characters into the number so that a 64-bit integer can represent 10 byte strings rather than 8 byte strings; but as you can see the gains are minimal.
And finally, there is the possibility of a hashing allgorithm to create a digest of the string. For example, the MD5 digest algorithm will 'reduce' strings of any length to a single, 128-bit number.
But, just as the pack based integerisation shown above mapped multiple strings to each number, the same is true of every digest/hashing algorithm in existence. So, any algorithm or logic based around using numerical digests to discover or ensure string uniqueness must be designed to accommodate the possibility of duplicate digests from disparate inputs.
Duplicates are a simple fact of life for all digest algorithms.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|