bangor has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
Apologies for being off-topic but I come here seeking advice as I know the depth of experience in the monastery (and everything else in the project is built with Perl).

I am creating a database for a heritage organisation who try to keep alive 100s of varieties of heirloom apple trees - the ones we will need when the climate goes really crazy!

The tree table has a number of fields like:
usage origin flowering ripe picking status rootstock is_triploid
Some of these fields can have multiple values, so to store them in an index I intend to create extra tables. I have come up with two options but am unsure which to choose.

Option 1
Create a table for each field:

CREATE TABLE tree_usage_index ( id int, name varchar(100), PRIMARY KEY (name,id) ); CREATE TABLE tree_flowering_index ( id int, name varchar(100), PRIMARY KEY (name,id) );
The problem I have with this set up is that to do a search for
trees with usage of 'cider' or usage of 'juicer' and with flowering of 'spring' and with ripe of 'autumn'
I will have to query multiple tables, and the code I have written for this seems a bit complex.
SELECT id from tree_usage_index WHERE name='cider' OR name='juicer' # if there are results SELECT id from tree_flowering_index WHERE id IN (results) and name='sp +ring' # if there are still results SELECT id from tree_ripe_index WHERE id IN (results) and name='autumn'
I know you can nest SQL selects but as there are 12 fields that could be involved in a query the SQL could get a bit hairy. Also, I will probably need to add an index to the 'id' field of each table to speed up the WHERE id IN (results)

Option 2
Create one table for all fields:

CREATE TABLE tree_field_index ( field varchar(100), id int, name varchar(100), PRIMARY KEY (field,name,id), );
The advantage I see with this is that I will only need to query the one table to perform the same search:
SELECT id from tree_field_index WHERE ( (field='usage' AND name='cider +') OR (field='usage' AND name='juicer') ) AND (field='flowering' AND +name='spring') AND (field='ripe' AND name='autumn')
Any advice on the right path to take, or some other solution, greatly appreciated.

Replies are listed 'Best First'.
Re: OT - SQL choosing a schema for index tables
by QuillMeantTen (Friar) on Aug 28, 2015 at 07:49 UTC

    Update, it seems I cant link the application like that so here is the url :
    application url

    Beware reader, it's in french. If you're interested I wrote it first in OOBasic as a macro for openoffice base, then rewrote it in java and wrote the web frontend in javascript using meteorjs

    some time ago I had to build an that would generate shifts calendars for the interns at my town's hospital I had to take into account quite a bit of factors and ended up having to make many different kind of searches to check and cross reference columns values.

    I say : plan for growth, have a look at this page to get a good start, if performance is not the main issue (say you wont have a metric ton of traffic and requests over small time) this should allow you to build with a good design from the ground up

    from my database course at uni I'd say that you would want to stop thinking in terms of fields and start thinking about what real things you want to document.
    As you build your ER model you will quickly identify what fields you need
    Then the next step is to get rid of what is legal in the ER model but would slow you down and make growth/maintenance hard.

    This step is (in my mind at least) very important because even if you dont have to regularly add new features if you do it well maintenance will be a breeze (at least compared to the alternative state of things).

    Worth to have a look at :

    1. only use atomic attributes but your ER model should take care of that
    2. avoid the use of unnecessary attribute put them somewhere else (again, ER model should help)
    3. Keys are good, dont have attributes in a table determined by anything else than a key, you'll thank yourself when writing new queries down the road
    4. keys should be decided with the domain in mind,not the implementation at least that's how I saw it when I learnt about them
    5. Multi valued dependencies was my least favorite, for this one I'll only say sit down with a pen and some paper, if you have done the previous work the best way you could their should not be too much trouble.
      try to find examples onlines this is a tricky one
    6. Boyce-Codd is easier than the previous one but then again I advise the study of multi valued dependencies, functional dependencies and such.

    This list is obviously incomplete, there are also the fifth and sixth normal forms but I have not studied them (yet) also take everything I told you with a pinch of salt since I'm only a student and while interested by dbs I'm not fascinated by them.

    last note : I tried to write a dbms in awk but gave up this coder did not, its a fun read ^^
    Cheers and as usual please correct any mistake you see I'll happily strike them out and update my personal wetware database

      Thank you for your advice QuillMeantTen, and for all the links - I see have plenty of reading to do.
Re: OT - SQL choosing a schema for index tables
by erix (Prior) on Aug 28, 2015 at 08:57 UTC

    Just some loose remarks:

    - What dbms will you be using? SQL implementations aren't so uniform that it does not matter (and in such a small db interoperability is probably unimportant?).

    - It seems unwise to name a table with 'index' or an '_index' suffix. An index belongs to a table but it is something separate (it isn't even part of the SQL specification; indexes are just an implementation detail that may or may not speed up a retrieval from a table).

    - The data you describe are so small (hundreds of rows) that performance and indexes are unlikely to be important considerations (small tables are always seqscanned). Concentrate on getting easy-to-understand queries.

    - Your option 2 (everything in one table) is called EAV and is generally frowned upon (see for instance this). I think eav is only defensible when a 'table structure' needs to be extremely flexible (and the structure is unknown beforehand), and this is not the case for you).

    Maybe you should try to show us the SQL-'hairiness' that you fear. It probably isn't as bad as you think :)

      Thank you erix for your helpful reply. I am using MySQL.

      - It seems unwise to name a table with 'index' or an '_index' suffix.
      Noted - I will change

      - ...Concentrate on getting easy-to-understand queries.
      This is exactly what I am trying to do as I know some one else will be maintaining.

      - Your option 2 (everything in one table) is called EAV and is generally frowned upon.
      Thanks for the links, and yes Option 2 seems to be a bad idea.

      Maybe you should try to show us the SQL-'hairiness' that you fear. It probably isn't as bad as you think :)

      I don't have any hairiness at the moment, what I have is a Perl loop which queries each table in turn:

      my ($self,$params) = @_; my %tables = ( origin => 'tree_origin', category => 'tree_flowering', ripe => 'tree_ripe', usage => 'tree_usage', ); my $ids; for my $name (keys %tables) { my $value = $params->{$name} or next; $ids = $self->search_index_table( $tables{$name}, $value, $ids ); last unless $ids; }
      And the sub to query the database is something like this...
      sub search_index_table { my ( $self, $table, $value, $ids ) = @_; my $query = "SELECT id FROM $table WHERE name=?"; if ($ids) { my $id_string = join(',', @$ids ); $query .= " AND id IN ($id_string)"; } my @results; my $dbh = $self->{'dbh'}; my $sth = $dbh->prepare($query); $sth->execute($value); while ( my ($id) = $sth->fetchrow_array() ) { push(@results,$id); } return (@results ? \@results : undef); }
      This all works correctly, but I can't help thinking that all these calls to the database server (which is on another machine) is very inefficient. I suppose my real itch is that there are only a couple of hundred rows in the main table, and I have another 10 or so tables to index the multiple-valued fields - from my reading this seems the correct way to do things, but at the same time feels over-engineered for such a small dataset.

        It's a bit unusual but if it works well and performs well, I wouldn't worry about it (because the data is so small).

        But it would indeed (as already mentioned upthread) make more sense to have an SQL statement that JOINs the tables and that retrieves the rows that you want in one trip to the databaseserver. (But it's quite possible that you'd hardly notice a speed difference.)

      (everything in one table) is called EAV and is generally frowned upon

      It is only frowned upon when there is another way. Such as, when all attributes are static. If there's only one value per attribute, it would go in the main table, if there are many, they would each get a child table. It is not the structure that needs to be flexible (that would likely be a bad use of an EAV) but that the attributes are unknown. Such as a translation table, where the translated items are unknown, as are the languages it will be translated into.

      I got he impression that the attributes themselves are not static, which is why i suggested an EAV. If they are static, however, it would be frowned upon by the purists in this case.

Re: OT - SQL choosing a schema for index tables
by parthodas (Acolyte) on Aug 28, 2015 at 07:16 UTC
    How many fields would be there for multiple values. IF there is only one such column, then you can think of creating a different table and use Foreign Key. If the no of columns are higher, then the situation will become too complex to get a data from different tables.
    Also, you can insert multiple values in a column. Though it is not advisable. So it depends, depending on your situation & requirement how you plan to retrieve/fetch data.
      There are 10 fields that can have multiple values. I did look into storing these in the main table but everywhere I saw "don't do that". You can see how I am currently fetching data in my reply to erix below.
Re: OT - SQL choosing a schema for index tables
by BrowserUk (Patriarch) on Aug 28, 2015 at 14:40 UTC

    This seems like an ideal use for BIT() type fields; assuming none of your multi-value fields can have more than 64 possibilities.

    Then in your program, you define constants for each of the bits in each of the fields:

    use enum qw[ BITMASK:Usage_ cider juicer cooking eating display_and_th +row_away ]; use enum qw[ BITMASK:Flowers_ spring summer autumn winter ]; use enum qw[ BITMASK:Ripe_ spring summer autumn winter ]; use constant Any => -1; ... my $sth = prepare( 'select id from trees where usage = ? and flowers = + ? and ripe = ?;' ); $sth->execute( Usage_cider|Usage_juicer, Flowers_spring|Flowers_summer +, Any );

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Of course it is possible --- but now the program is needed to explain those bit-values that are in the database. I count that as a disadvantage.

      Does not sound like a good idea to me. It would be more acceptable when the mapping could be done inside the database (thus documenting these bit-fields). But even then this does look like over-engineering to me. (There are 200 rows, the OP has said).

        It would be more acceptable when the mapping could be done inside the database

        It's pretty trivial to set up secondary table(s) that maps bits to strings. Done right, they could be joined to produce human readable output.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: OT - SQL choosing a schema for index tables
by chacham (Prior) on Aug 28, 2015 at 14:50 UTC

    This is similar to an entity with unknown attributes, to which the usual solution is an Entity-Attribute-Value table, similar to your second option.

    EAVs are feared for their slowness. However, in many cases, it is simply not true, as they perform quite well. Further, in fancier RDBMSs, the table can be partitioned, which works well.

    Ids are used when there is no natural primary key. (So many people use surrogate keys out of sheer laziness, ignorance, or stupidity.) Though, assuming the only unique identification of the tree is that it is noted in the database, id is fine. This would leave 2 tables. One for the tree, the other the EAV. Also, assuming the entire database or schema is tree related, "tree" becomes redundant, unless related to the main table:

    CREATE TABLE Tree ( Id INT PRIMARY KEY, Owner Nickname Location .... ); CREATE TABLE Tree_Attribute ( Tree REFERENCES Tree, Attribute VARCHAR(020), Value VARCHAR(100), PRIMARY KEY(Tree, Attribute, Value) );

    Depending on the RDBMS, you may not be able to use the word value. Also, you might be required to define Tree in the second table. If allowed, however, this allows for better coding. Of course, you can name the keys, to make error messages clearer, though that is a matter of preference.

    If attributes are not arbitrary, will contain a description, or will be chosen from a drop down or the like, you can add a lookup table for attribute as well, and FK the EAV to it accordingly.

    Update: On second thought, i was reading this as if the attributes were unknown. If they are known, a separate table per attribute would be advisable.

Re: OT - SQL choosing a schema for index tables
by 1nickt (Canon) on Aug 27, 2015 at 22:06 UTC

    Correct, this is way OT. You should do some reading elsewhere on database design; your question is fundamental. (Look into JOIN, you rarely need nested SELECTs for the type of work you described.)

    The way forward always starts with a minimal test.
Re: OT - SQL choosing a schema for index tables
by Anonymous Monk on Aug 28, 2015 at 01:38 UTC

    Hi,

    It's been a while since I did anything serious with a DB, but couldn't you just create an Index on your main table, rather than creating other tables?

    J.C.

      Yes but the problem is that these fields have multiple values, and I don't think I can create a usable index on a column that has more than one value.
        SELECT id from tree_field_index WHERE ( (field='usage' AND name='cider') OR (field='usage' AND name='juicer') ) AND (field='flowering' AND name='spring') AND (field='ripe' AND name='autumn')

        This won't work. The select statement gets applied to each record in turn.You can't have a single record where field='usage' AND field='flowering' so the result will be nothing.

        Which database are you using, and how many records roughly are you dealing with ?

        poj
Re: OT - SQL choosing a schema for index tables
by locked_user sundialsvc4 (Abbot) on Aug 28, 2015 at 14:32 UTC

    I see immediate problems with table-names like tree-flowering.   Here would be my rules of thumb for designing such a database.

    (1)   Create a table for each “thing.”   Well, in this case, the only “thing” is a tree.   Any tree, “flowering” or not.   Each tree record should have a primary key.   Let’s call it tree_id.

    (2)   Create other table(s) for “attributes.”   Let’s say that each of the fields you listed (usage, origin, flowering, etc.) are all multi-valued attributes, each of which can have only one of a set of possible values chosen from a list.   In this case, each one would be in its own field as you have now done.

    (3)   Create tables to contain lookup-lists for each attribute.   A table that contains, for example, all possible values for flowering.

    (4)   Use referential integrity rules and foreign keys to ensure consistency.   For instance, a rule that specifies that flowering must contain a value from the lookup table, and that, should the value of the lookup be changed, the change will “cascade” to all occurrences of the former value.   Also, that attribute values are unique and are NOT NULL.

    (5)   Use efficiencies like IN('spring', 'fall') to simplify queries.   ‘Nuff said.

    The actual SQL schema is left as an exercise for the reader.