Numeric representation of strings

Yona has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Numeric representation of strings by Theodore (Hermit) on Apr 24, 2014 at 14:09 UTC
It looks that what you are looking for is a digest: Digest::SHA	[reply]
Re: Numeric representation of strings by BrowserUk (Patriarch) on Apr 24, 2014 at 14:47 UTC
How can I create unique integers to represent each string? It's easy. The character representation of a string already is a unique numeric representation of that string. C:\test>perl -nle"printf qq[%-20s : %u\n], $_, unpack 'Q', pack 'Z8', +$_" words.txt aa : 24929 aah : 6840673 aahed : 431198069089 aahing : 113723912511841 aahs : 1936220513 aal : 7102817 aalii : 452740276577 aaliis : 126896577470817 aals : 1936482657 aardvark : 32195308464267617 aardvarks : 32195308464267617 aardwolf : 30521856061759841 aardwolves : 30521856061759841 aargh : 448412148065 aarrgh : 114793511018849 aarrghh : 29388191088927073 aas : 7561569 aasvogel : 28542701074080097 aasvogels : 28542701074080097 ab : 25185 aba : 6382177 abaca : 418279154273 abacas : 126862116348513 abaci : 452638892641 aback : 461228827233 abacterial : 32199697902953057 abacus : 126948015694433 abacuses : 28555920663470689 abaft : 499933864545 abaka : 418413372001 abakas : 126862250566241 abalone : 28550397486522977 abalones : 28550397486522977 ... [download] Of course, the problem is that using the largest integers easily available to us -- 64-bit -- we can only accommodate the first 8 bytes of the string. After that, the value of the extra characters no longer fit into the 64-bit representation and so words like aardvark and aardvarks are indistinguishable. There are modules on cpan that allow us to use larger integers -- salva has 128 & 256-bit integer modules -- which would double and quadruple the length of strings we can uniquely accommodate, but then the second problem becomes obvious. The numeric representations take up just as much space as the strings they represent. So nothing is gained. Another possibility arises if the alphabet used for the strings is smaller than the character code representation it uses. Ie. If the string only use (say) lower-case alpha characters and digits, then those 36 characters can be represented in 6-bits, so it becomes possible to pack the characters into the number so that a 64-bit integer can represent 10 byte strings rather than 8 byte strings; but as you can see the gains are minimal. And finally, there is the possibility of a hashing allgorithm to create a digest of the string. For example, the MD5 digest algorithm will 'reduce' strings of any length to a single, 128-bit number. But, just as the pack based integerisation shown above mapped multiple strings to each number, the same is true of every digest/hashing algorithm in existence. So, any algorithm or logic based around using numerical digests to discover or ensure string uniqueness must be designed to accommodate the possibility of duplicate digests from disparate inputs. Duplicates are a simple fact of life for all digest algorithms. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re: Numeric representation of strings by Anonymous Monk on Apr 24, 2014 at 15:07 UTC
BrowserUk has given you an answer with a literal and strict interpretation of "unique", and Theodore a common solution if the numbers don't need to be truly unique, just extremely unlikely to collide. Here's two more approaches: 3. If by "complicated" you mean the strings are very long, and the numbers need to be truly unique, you can compress the strings, such as with zlib. 4. If your strings repeat, you could create a lookup table that maps each string to a unique number, and then replace each occurrence of the string with that number.	[reply]
Re^2: Numeric representation of strings by Yona (Initiate) on Apr 26, 2014 at 15:01 UTC
Thanks a lot for your help guys!	[reply]