in reply to (OT) Database design: omitted partially filled fields.

The number of columns should not be a problem so don't reject option one out of hand. Serializing data, OTOH, is a problem if you intend to do any searching on the serialized column - you force yourself into a situation in which half your search is a SQL query and half is a perl query and you have to serialize and unserialize at every step. So I'd say option #3 is probably not your best bet.

We don't really have enough information to decide between options 1 and 2 - it really depends on the structure of the data. This is the kind of situation where the Entity-Relationship model can be helpful - what are your entities and how do the relate to each other? In the ER model, columns are attributes of entities so ask yourself if the attributes apply to the entities and base your decision on that. For example if the columns are attributes that apply to all rows, but just happen to be missing, then you want one table. But if some attributes only apply to some rows, you want several tables. If your table is vehicles and you only know the mileage for some vehicles, the mileage attribute applies to all rows but is missing for some. But if some of your vehicles are firetrucks while some are sports cars and an attribute is "ladder_length", that attribute only applies to the firetruck rows and you should have separate tables for firetrucks and other vehicles.

The other thing that ER can help you think about is the retlationship between entities (one-to-many, many-to-many, has-a, is-a, etc.). For example, if your entities are people and you have columns for "speaks_English", "speaks_French", and your missing values indicate people that don't speak one of the languages, that is a many-to-many relationship and you should have a separate lookup table with an id for the person and a single column "language_spoken" with values "French", "English" etc. - in this case a single person can have multiple values for the attribute "language_spoken" so the multiple table approach is indicated. Yes, this will mean more complex queries that require joins, but any decent database will optimize for such queries with indexes and the complexity will relate more to the SQL you need to know than to the database itself, which will actually be less complex when normalized into multiple tables.

Your decision should not usually be based on the absolute number of columns or even the absolute number of tables, rather on the structure of the data itself.

  • Comment on Re: (OT) Database design: omitted partially filled fields.