Re: An improved technique for database primary keys
by fullermd (Vicar) on Nov 05, 2010 at 05:32 UTC
|
The choice of “8 characters’ is arbitrary, but it should simply be long enough that a “collision” is impossible.
I woke up screaming when I read that. And I wasn't even asleep.
| [reply] |
|
|
Sleep well. It is a calculated-risk, and of course the index in question must be UNIQUE so that colliding values, if they were somehow generated, would not be stored. You could, of course, go so far as to use Microsoft’s UUIDs (GUIDs).
The notion is simply that, if the string is random and sufficiently long, key collisions in-practice “ain’t never actually gonna happen.” Your mileage may vary. Choose appropriately. If, in whatever context you may be dealing with, this admonition isn’t sound, then don’t follow it. Peace.
| |
Re: An improved technique for database primary keys
by moritz (Cardinal) on Nov 05, 2010 at 08:55 UTC
|
This should be a column whose only purpose in life is to uniquely identify the row, and this strictly for the database’s own, internal purposes.
Which is why it makes me uneasy to generate the primary key for new rows outside the database. Sometimes you don't want to use Perl to talk to the database (for example using the database's command line client for administrative tasks is very handy), and it would be a pity if you couldn't insert any rows with reasonable effort without going through your Perl scripts/modules.
I don't know if the databases that you have to work with support generating such sequences, but I for one would feel uneasy about moving such a core component of database integrity out of the database, if it can be avoided by any means.
Perl 6 - links to (nearly) everything that is Perl 6.
| [reply] |
|
|
As noted, this task is performed by a Filer module within my application, which is “solely responsible for its content” (and which provides a public moniker-generating subroutine for anyone and everyone to use). It is a calculated, and I think in this case defendable, design choice. But your point is well-taken.
| |
Re: An improved technique for database primary keys
by duelafn (Parson) on Nov 05, 2010 at 05:15 UTC
|
psql> CREATE SEQUENCE foo;
psql> CREATE TABLE bar (
id char(10) PRIMARY KEY DEFAULT 'bar_' || nextval('foo')
, name text
);
psql> INSERT INTO bar (name) VALUES ('Alice');
psql> INSERT INTO bar (name) VALUES ('Bob');
psql> SELECT * FROM bar;
id | name
------------+-------
bar_1 | Alice
bar_2 | Bob
| [reply] [d/l] |
Re: An improved technique for database primary keys
by JavaFan (Canon) on Nov 05, 2010 at 10:06 UTC
|
| [reply] [d/l] |
|
|
The “literal” content of the moniker is expressly designed to make the string more sensible to humans. If the moniker simply consisted of a random string of characters, then it would effectively be “just like a large integer,” in the sense that, if you accidentally got one in the wrong place, you could never know it nor detect it.
As you correctly and astutely observe, my phrase, utterly devoid of ... other meaning, is not strictly true, at least with one particular reading of the English word, “meaning.” The random string, itself, means nothing. The prefix is to be used only as a convention.
It would be verboten to write logic that examined the value of the moniker and derived any sort of “actionable understanding” of the record’s content based on any prefix-string found there. That would be an egregious violation of normal-forms. But to make key-strings convey information that is useful for debugging and for automatic data-integrity verification ... that, I submit, is useful.
I agree with your point about page-splits. You are absolutely correct that this strategy comes with costs, and this is one of them. (If your database supports hashed, vs. B-tree, indexes, and if you elect to use them due to reasons other than just this, then that cost can be somewhat relieved.)
| |
Re: An improved technique for database primary keys
by herveus (Prior) on Nov 05, 2010 at 12:54 UTC
|
Howdy!
Every relational database table should, of course, have a primary key. This should be a column whose only purpose in life is to uniquely identify the row, and this strictly for the database’s own, internal purposes. It should be a column whose values are utterly devoid of business or other meaning.
I beg to differ. Every table should have a primary key. However, that key can (and often should)
be comprised of multiple columns. What you describe here is the use of surrogate keys.
Sometimes you have no real choice but to use a surrogate key, but they are frequently misapplied.
Even if you do use a surrogate key, there should be a set of columns whose values identify
the row. Those columns are the natural key. If any of those columns are permitted to be NULL,
they can't be a primary key. The best you can do with them is make a UNIQUE index on them.
The risk with surrogate keys (especially if the natural key is ignored) is that you will
get duplicate data. I've seen that in the wild. Use surrogate keys with care, if you must,
but remember to identify the alternate key and enforce it.
Aside from that major quibble, I do like your idea.
| [reply] |
|
|
| [reply] |
|
|
| |
|
|
You are correct: these are, indeed, “surrogate” keys.
“Constants aren’t. Variables don’t. Unique numbers aren’t (at least not if any human being has anything to do with it). Even computer-generated “non-repeating” numbers sometimes do. Plan accordingly.”
As a general design principle, I routinely elect to use surrogate keys as the record-identifiers, i.e. for table-JOIN purposes, and to use UNIQUE indexes for any of the “natural” uses that you describe.
I learned to do this when working on a very early project for an insurance company which had converted from an original paper-based provider-ID numbering system. The same provider-ID was assigned to more than one provider, and providers had more than one ID. Furthermore, business requirements made it impossible for this situation to be changed, because these numbers were widely-distributed in the field. My strategy involved the use of surrogate keys. These keys were “nonsensical character strings,” much like these, and they were never, ever divulged to the user. The application stored the provider-ID number that had been entered, which was known to be ambiguous, and it resolved that ambiguity (by various rules and by user-input) and stored the corresponding (unambiguous) surrogate-key in another column. All table linking took place using the surrogate columns. This approach did successfully solve what was up to that time a very thorny business problem.
| |
|
|
Howdy!
...and you describe exactly the kind of situation where a surrogate key becomes necessary.
I simply object to the terminology which implies that primary keys are, necessarily, opaque
values stored in a single column. Changing the title to refer to surrogate keys would
fix that.
| [reply] |
|
|
Multicolumn primary keys are ... a major pita as soon as you need to do any joins. I had to write some queries for a schema that used them and it was horrible and error prone. It was way too easy to forget one of the three columns in the sixth join. Plus of course having to repeat all those columns in all those related tables is wasteful.
If you want to ensure something is unique, use a unique index/constraint, but please do not ever even think of using a multicolumn PK!
Update: I forgot to add ... unless the table is basically just implementation of an N-N relation. With or without additional columns. In that case you are unlikely to need to join using both columns at the same time (see ... you can't assume you'll always include the whole PK in all joins) and the surrogate ID would (almost) never be used.
Jenda
Enoch was right!
Enjoy the last years of Rome.
| [reply] |
|
|
Howdy!
It's a tradeoff. If you use surrogate keys, you always have to join to get the actual key values.
Multicolumn PKs are natural. Surrogate keys obfuscate the true structure. Foreign key constraints
are essential to document the key structures and to enforce referential integrity. They also
help query building tools get the joins right.
| [reply] |
Re: An improved technique for database primary keys
by Tux (Canon) on Nov 05, 2010 at 14:23 UTC
|
Would the title, and the tone of your post, would have been "An alternate technique ..." instead of "An improved technique ...", I would have had a completely different feeling.
I really abhor your thoughts now. This goes straight into colision course with normalization. There is absoloutely no need for surrogate keys if the primary key is unique and used as such. In many many many occasions, the real key indead has a meaning, or even better, a centralized (or decentralized looking from the opposite site) location where the "base" table values can be fetched for reference. Think of the prefix for phone numbers for countries. In those cases the keys are obvious and logical. Dialing +31 will get you to the Netherlands, and that is very unlikely to change. In a table that would store the country name for prefixes, the keys should simple be 31.
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
|
|
Poof... your wish is my command. I have no wish to be provocative in my selection of titles, nor in the tone of my posts. (Thank you for the guidance.) I changed the thread-title, although the child-records (existing replies) were not changed thereby.
Nodding politely (but, flinching slightly) at your use of the word, “abhor,” I find it to very often be the case that no “human provided” number serves well as a primary-key. As the subject of a “UNIQUE-indexed column,” perhaps, but not as “the glue that sticks everything together.”
I say this because “unique numbers” in the human world are often not perfectly unique, or if they are, do not stay that way. And when (not if...) one of these eventualities happen, they are very problematic for the computer. An account-number change, for instance, that “ripples” through thousands of records, for instance, and/or that invalidates an archive if the account is of very long standing. An assigned number (in-use by a human) that cannot be stored. Or, as I related above, a number that does not have the one-to-one correspondence with reality that it should (according to the good Dr. Codd) have. In those situations, surrogate (or “synthetic”) keys might excel. You are obliged to handle, and to continue to handle, a real-world circumstance that is “mathematically speaking, compromised.” The business isn’t going to stop for a schema. I dealt with that first-hand, and yes, it shaped my thoughts.
Of course it is a choice to be made; not an absolute maxim. “So help me, Codd.”
| |