Re: Algoritm for converting string to number?
by Old_Gray_Bear (Bishop) on Aug 23, 2011 at 09:09 UTC
|
If this were my problem, I'd be reaching for a database immediately. My first column is the key and consists of your character string; the second column is a monotonically increasing INT of the longest precision I can get.
Almost any other 'algorithmic-mapping' solution is going to give you fits down the line when your mapping scheme encounters strings that generate integers that won't fit into your remote system's idea of INT. (Now if this is just a one-off to help get converted away from that silly system, go ahead with the algorithmic mapping, it's going to be less resource-intensive.)
But, if it were me, I'd still go with the database, though (SQLite probably). One-Offs have a nasty habit of hanging around for a long time....
Update -- corrected first word =~ /I/If/
----
I Go Back to Sleep, Now.
OGB
| [reply] |
|
|
Couldn't you do the same thing with just a regular Perl hash? (with the string being the key, and a global big int incremented as a unique value?)
| [reply] |
|
|
| [reply] |
|
|
|
|
|
Re: Algorithm for converting string to number?
by moritz (Cardinal) on Aug 23, 2011 at 08:29 UTC
|
A computer internally stores all data in bytes, so you could just interpret those bytes as an integer.
unpack (see also: pack and perlpacktut) can help you here:
$ perl -wE 'say unpack "q", "abcdefgh"'
7523094288207667809
But note that this leads to rather large integers, 8 charaters (with codepoints up to 255) text fit into an unsigned 64 bit int.
If the strings are larger than that, it might be necessary to create a persistent lookup table, where you assign integer values to string labels, and look them up in there.
| [reply] [d/l] |
Re: Algoritm for converting string to number?
by DrHyde (Prior) on Aug 23, 2011 at 09:50 UTC
|
You could naively convert all characters to ints using ord() and then concatenate them together. Don't forget to pad with zeroes to avoid problems with, eg, 654 representing either character 6 followed by character 5 followed by character 4, or character 65 followed by character 4, or character 6 followed by character 54.
For all but trivial strings this will lead to very long numbers.
So as an alternative you could treat your string as being a number in base N and simply convert it to decimal. eg, if the valid character in your input are case-insensitive ASCII letters, numbers and spaces, that's 37 "digits", so a number in base 37. Math::NumberBase is your friend here. This could still lead to very big numbers though. "AK47" is quite a small number in that system, but "Avtomat Kalashnikova 47" is a very big number.
Finally, if you're prepared to accept a *tiny* risk of two strings mapping to the same number, use a hash function. MD5 will give you a 128 bit number. If your set of inputs is small (of the order of a few thousand) then you can just take the first 32 bits of that and still generally avoid collisions, or if you have a few million inputs, take the first 64 bits.
| [reply] |
Re: Algoritm for converting string to number?
by JavaFan (Canon) on Aug 23, 2011 at 10:12 UTC
|
Well, yes, and no.
Here's the no answer:
Integers are typically 32 or 64 bit characters, giving you the ability to uniquely map 232 or 264 strings. That is, if you limit yourself to Latin-1 strings, strings 4 or 8 characters long. A few characters more if your strings contain only letters and digits. Less characters if your strings use the full Unicode range.
Here's the yes answer:
Use a database table, with two columns: a string field, and an integer field (autoincrementing). Use this to map the strings to (unique) integers. | [reply] |
Re: Algoritm for converting string to number?
by choroba (Cardinal) on Aug 23, 2011 at 08:04 UTC
|
See ord. For strings of more than one character, you might need split and map, too. | [reply] |
Re: Algoritm for converting string to number?
by BrowserUk (Patriarch) on Aug 23, 2011 at 11:20 UTC
|
What are the chances of there being two or more of your textual product codes clashing if you transformed them by:
- converting all letters to uppercase ascii (eg.A-Z);
- stripped out all spaces and punctuation;
- truncate them to 13 characters;
If the chances of a collision are low -- and I think they would probably be so low as to be considered negligible -- then you can have a mapping function to a 64-bit int that will cover 36**13 (170,581,728,179,578,208,256) product ids, which is probably enough to be going on with.
Some will say that relying on 'negligible risk' is still "too dangerous", completely failing to recognise that every digest, UUID/GUID etc. relies on exactly that principle.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: Algoritm for converting string to number?
by GrandFather (Saint) on Aug 23, 2011 at 11:37 UTC
|
Tell us more. What you are describing so far is hashing, but to guarantee unique hash values you need to know a lot about the range of possible input values and the allowed size of the output hash (number of bits in the int).
True laziness is hard work
| [reply] |
Re: Algoritm for converting string to number?
by Your Mother (Archbishop) on Aug 23, 2011 at 13:58 UTC
|
I do not think this is a good idea—you got some already—but it does fulfill the letter of the request. :P
use Math::BigInt;
use Digest::SHA "sha1_hex";
my $string = shift || die "Give a string!\n";
print Math::BigInt->new( "0x" . sha1_hex($string) ), $/;
__END__
"mabye a db would be better"
-> 1118343603570750339537815681476550431532447928026
| [reply] [d/l] |
Re: Algoritm for converting string to number?
by locked_user sundialsvc4 (Abbot) on Aug 23, 2011 at 14:53 UTC
|
While it is possible to “hash” any string value to produce an integer ... I believe that, by doing so, you would be violating one of the tenets of “normal form,” namely, that identifying-numbers ought not contain any embedded information.
Of course, at least until UPC-codes came along, product identifiers usually did contain a lot of embedded information. (Vehicle Identification Numbers, for example, of course still do.) But these string-values that you speak of are really attributes of the product itself. They don’t belong in a product-code. If you have any say in the matter, counsel against it.
A thing that is good to design into a product code is a check-digit of some kind. The ISBN codes that are used in the publishing industry, as well as credit-card numbers, both include such check digits. But the check digit does not convey any information. Its only purpose is to immediately detect bogus scans.
Whether or not you send this additional “meta data” to the target system, you should keep it. A database, or a database file e.g. SQLite, is an appropriate choice.
| |