BUU has asked for the wisdom of the Perl Monks concerning the following question:

I have some data I'm going to be parsing to store in a database. This data has a dozen to two dozen fields. The problem is that not every "record" has all of these fields, so one record might have 12 fields, and the next one only has 5 and the next has 24. So my problem is, how should I design a table to accomodate this?


Option 1: The first thought is to keep it simple, simply create a table with the thirty odd columns that can actually appear in the data and then leave the fields null if the data doesn't provide it.

Having 30+ columns in a RDBMS just seemed like a bad design to me though, so I'm unsure if this is a good idea. I can't really quote anything concrete either way, but this just "Feels" wrong.


Option 2: The second thought was to have two tables. The first table would have all of the data that every record posesses, and the second table would basically have 3 columns: belongs-to, type, value, so I could record all of the extra data there.

I'm sure I've seen people recomend against this strategy however, and it also feels kind of nasty.


Option 3: Again make a table that just has columns for the data that every record has, but add an extra column to store a seralized perl datastructure that contains the extra data for a particular record.

Again this seems like a bad idea because then the database can't natively access this extra data, with all the problems that entails.


So none of the solutions "feel" very good, which is why I come here. Is one of these solutions better than the others, good enough to use? Or is there a better way to solve this problem?
  • Comment on (OT) Database design: omitted partially filled fields.

Replies are listed 'Best First'.
Re: (OT) Database design: omitted partially filled fields.
by mpeppler (Vicar) on Jul 21, 2004 at 15:07 UTC
    30 columns in a table shouldn't be a problem, as long as there is no real duplication of data (meaning that there is no way to normalize the data by splitting it into 2 or more tables).

    Your option 2 gives you a lot of flexibility, but makes retrieving the data that relates to a single source record rather painful (multi-table joins). If your application is going to fetch this data a lot then I wouldn't use that schema.

    Option 3 works well, as long as you don't need to refer to the serialized data in any SQL statement (i.e. in a WHERE clause, for example.). An alternative to serializing the data is to store an XML block with the additional "columns", something that is becoming fairly common, with DBMS engines that are able to generate pseudo columns from the XML with an internal XML parser.

    All in all my hunch is to go with the most straightforward solution (your Option 1), though I obviously don't know all the details of the app to gauge whether this might be too limiting.

    Michael

Re: (OT) Database design: omitted partially filled fields.
by xorl (Deacon) on Jul 21, 2004 at 14:59 UTC
    This is more of a database question than a perl question, but I can see how people could find the answers helpful.

    I'd start by learning about Database Normalization: http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html

    In this specific case option 2 sounds like it might be the closest to a normal form.

    Of course option 1 will work, it just probably ain't the greatest.

    I wouldn't do Option 3. While Perl will have no trouble parsing this field, other applications might. It also limits the usefulness of the database.

Re: (OT) Database design: omitted partially filled fields.
by adrianh (Chancellor) on Jul 21, 2004 at 15:16 UTC

    One of my favourite answers - it depends :-)

    If you're 30 odd fields are all the same sort of thing (e.g. software projects with 1-30 members) then I'd split it off into separate tables. One for programmers, one for projects, and one to map programmers onto projects.

    If you've really got 30 separate unrelated fields then just stick them all in the same table. You might have to split it later for efficiency reasons - but you probably won't so don't prematurely optimise ;-)

    That said, I'd be surprised if all 30 fields are unrelated. Is there another concept in there that you can abstract out to another table (e.g. four fields that should really be an address table)?

Re: (OT) Database design: omitted partially filled fields.
by jZed (Prior) on Jul 21, 2004 at 15:55 UTC
    The number of columns should not be a problem so don't reject option one out of hand. Serializing data, OTOH, is a problem if you intend to do any searching on the serialized column - you force yourself into a situation in which half your search is a SQL query and half is a perl query and you have to serialize and unserialize at every step. So I'd say option #3 is probably not your best bet.

    We don't really have enough information to decide between options 1 and 2 - it really depends on the structure of the data. This is the kind of situation where the Entity-Relationship model can be helpful - what are your entities and how do the relate to each other? In the ER model, columns are attributes of entities so ask yourself if the attributes apply to the entities and base your decision on that. For example if the columns are attributes that apply to all rows, but just happen to be missing, then you want one table. But if some attributes only apply to some rows, you want several tables. If your table is vehicles and you only know the mileage for some vehicles, the mileage attribute applies to all rows but is missing for some. But if some of your vehicles are firetrucks while some are sports cars and an attribute is "ladder_length", that attribute only applies to the firetruck rows and you should have separate tables for firetrucks and other vehicles.

    The other thing that ER can help you think about is the retlationship between entities (one-to-many, many-to-many, has-a, is-a, etc.). For example, if your entities are people and you have columns for "speaks_English", "speaks_French", and your missing values indicate people that don't speak one of the languages, that is a many-to-many relationship and you should have a separate lookup table with an id for the person and a single column "language_spoken" with values "French", "English" etc. - in this case a single person can have multiple values for the attribute "language_spoken" so the multiple table approach is indicated. Yes, this will mean more complex queries that require joins, but any decent database will optimize for such queries with indexes and the complexity will relate more to the SQL you need to know than to the database itself, which will actually be less complex when normalized into multiple tables.

    Your decision should not usually be based on the absolute number of columns or even the absolute number of tables, rather on the structure of the data itself.

Re: (OT) Database design: omitted partially filled fields.
by herveus (Prior) on Jul 21, 2004 at 15:04 UTC
    Howdy!

    It's hard to tell. Sometimes your first option is the correct one.

    Do you have a normalized scheme for the data? Are there repeated fields? Are they all distinct (if sometimes/often null)?

    If there really are thirty attributes, you may well need thirty columns. Don't reject it because it "feels wrong", although your concern is not misplaced.

    Option 2 might be workable, but that depends on how the data is set up and used.

    yours,
    Michael
Re: (OT) Database design: omitted partially filled fields.
by dragonchild (Archbishop) on Jul 21, 2004 at 15:49 UTC
    Why are you trying to stuff all the different types of records into one table?

    I suspect you have a situation similar to the following:

    1. I have pharmaceutical products. They have a set of attributes, plus a name and an ID.
    2. I have services. They have a different set of attributes, plus a name and an ID.
    3. I have lineitems. They need to FK to a given product, but one lineitem might be for a pharmaceutical and another for a service.

    I created three tables. One is a general product table. It contains the name, ID, and any other common attributes. The others are detail tables. Every single row in the product table corresponds to a given detail table.

    Now, you're looking at this and saying "What if I want every detail about every single product? That's a ton of outer joins!". And, you'd be right. But, my experience is that I already know if I want pharmaceuticals or services. I almost never want to pull both at the same time. If I do, it's usually just pulling the name (which is on the product table). If I want to drill down, I can do so.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: (OT) Database design: omitted partially filled fields.
by knoebi (Friar) on Jul 21, 2004 at 14:57 UTC
    I would go for Option 1! KISS. I don't see any problem having 30 cols in a RDBMS.

    Option 1
    + fast, you only need one select and one update to write or read one record
    + easy to understand for everybody

    Option 2 + 3

    - slow
    - not intuitive
    If you have some required fields and 0...many optional fields for your records, you could also write all the optional field serialised in one row, but then you can't access the optional fields simple with sql.

    ciao, knoebi

      You seem to have forgotten the bad problems with issue 1 and the good of options 2 and 3.

      1. If you have 1e6 rows already, and decide to introduce a new column, that's 1e6 null entries you have to create. It takes up space depending on the implementation.

      If your table is thick, it may be a sign of poor organization. BUU suggested 2 tables, 1 for required, 1 for unrequired. It could be grouped even further. 1 for primary info, 1 for say, addresses information (if it is some of the data) and so forth...

      2. It's probably negligible on what's being done. If you want to update all data, yes, it's slow. If you want to select, it may not be bad. It all depends on the frequency of operations.

      And tell me, what database IS intuitive? W/o an ERD, looking at a db is a pain unless someone explains things.

      Bart: God, Schmod. I want my monkey-man.

Re: (OT) Database design: omitted partially filled fields.
by Joost (Canon) on Jul 21, 2004 at 15:19 UTC
    Some databases (for example, postgress) support table inheritance, so you can make a base table containing all the columns that are always filled, and have other tables inherit from it and add their own columns. This can be useful if you can split up your rows into seperate "classes" with an inheritance relationship.

    Personally I'd go for your options 1, though. I've used some of the other schemes (and table inheritance too), but I found them slow and complicated and they only give you real benefits in very specific cases.

    updated: spelling

Re: (OT) Database design: omitted partially filled fields.
by TrekNoid (Pilgrim) on Jul 21, 2004 at 15:34 UTC
    I have some data I'm going to be parsing to store in a database. This data has a dozen to two dozen fields. The problem is that not every "record" has all of these fields, so one record might have 12 fields, and the next one only has 5 and the next has 24. So my problem is, how should I design a table to accomodate this?

    The answer is 'depends'.

    1.) Depends upon how big your data set is going to get

    2.) Depends on how the data is going to be retrieved (through occasional big queries, or short small ones.

    3.) Depends on the type of data the fields are (null Integers don't take up a lot of room, for example, so no reason to worry about them being empty).

    I've gone different ways, depending on the above

    In general, though, if you're 30+ columns are a normalized set of data, and the 'sometimes' values aren't that long, then you really can't go wrong with KISS. It's the best way to ensure the most long-term variability of use.

    If you know that you're rarely going to do 'long batch queries', and this is more for a fast-retrieval of single records, then Option 2 isn't a bad choice. Option 3 even works as long as your '5 or so fields' are the one you match on most often.

    A variation on Option 3 that I've used *once* in my life, for a very specific design, is to create two tables, with a one-to-one required relationship. I put the smaller, searchable subset in table one, and the rest (mostly message headers, various textual fields, etc...) into the other table.

    That served two purposes:

    1.) The main table was small, compact, and highly indexed, so searching, exporting, debugging, etc, was a snap.

    2.) The second table, after some period of time, was determined to not be necessary, except for two fields, so it was easy to move the two fields to main table and just drop the second one.

    Hope some of that's useful

    Trek