coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:

I'm wracking my brain on this one as my database structure-savvyness isn't quite up to par as of yet, so before I make a sad attempt and waste dozens of hours...,I'm here to ask the wise monks to help me get started.

I'm making a link management system which I know upfront I can do 99% of the work. The part that boggles me is how some scripts allow infinite categories (and infinite categories DEEP!). How is it possible to set up infinite levels of categories for each link?

my $sth = $dbh->prepare( "CREATE TABLE IF NOT EXISTS categories ( id int auto_increment NOT NULL, url VARCHAR(300) NOT NULL, primary key (id) )"); $sth->execute();
I started going this route and using the URL as a common field to relate to another table. That's fine and dandy, I can give a single url a million different categories this way. But I'm not sure how to make categories have categories of their own, especially since the end user will create the levels via a web form during configuration.

Some sample data:

Large Animals - Small Animals - Ocean Life - 4-legged - 4-legged - Under 12" - 2-legged - 2-legged - 12" - 3' - *invertebrate *carnivore *herbavore
As shown above, there are multiple levels of sub categories for the major categories. I am totally confused at where to start. This will be a MySQL database.

My guess is I'd have one table categories for all the first-level categories. Then I could create tables for every single next-level category saving the information of the categories it relates to in each of them. But THAT seems wrong and definitely a waste.

Replies are listed 'Best First'.
Re: database table advice
by aufflick (Deacon) on Mar 27, 2007 at 02:15 UTC
    The database structure for a tree of data is pretty easy - navigating that tree can be tricky. All you need is for each category to have a parent_id field, something like the following (untested) postgres DDL:
    CREATE SEQUENCE category_ids; CREATE TABLE categories ( category_id int NOT NULL default nextval('category_ids'), parent_category_id int ); CREATE SEQUENCE link_ids; CREATE TABLE links ( link_id int NOT NULL default nextval('link_ids'), category_id int REFERENCES categories, url VARCHAR, description VARCHAR );
    Hopefully you can see how that allows for a tree of categories like:
          Animals{1}
        /            \
    Mammals{2}     Invertibrates{3}
      /      \
    Cows{4}  Humans{5}
    If the numbers in curly braces are the category_ids, then, for example, the parent_category_id of Mammals is 1 and that of Humans is 2.

    This maps the tree perfectly, but determining whether, for example, a Human is an Animal but not an Invertibrate is tricky. Oracle has some nice tree extensions to SQL that I have used. I think Postgres does too these days. You can also get an RDBMS neutral implementation by adding a sort column where you store binary sort data. Implementing such a beast is an exercise left for the reader ;)

    Update: what you're looking for in Oracle is "CONNECT BY". SQL 99 has "recursive" query support, but Postgres (and presumably MySQL) don't support it. For an excellent discussion of hierarchical data storage and retrieval in SQL see http://www.yafla.com/papers/sqlhierarchies/sqlhierarchies.htm - for a description of a Postgres implementation see http://openacs.org/forums/message-view?message_id=16799

    Update2: The queries in the link posted from mysql.com that use a single parent id look to me like they would slow down significantly for each nesting level you add - and some of their queries enforce a maximum depth. There's only so many times you can type "LEFT JOIN"! Their idea using left and right bears some investigation, but I would be nervous about application code having to keep that data consistent.

Re: database table advice
by whereiskurt (Friar) on Mar 27, 2007 at 02:02 UTC

    The question I'm hearing is 'How do I store hierarchical data in a relational DB?' (of course I could be missing the point.) If not, take a peek at Managing Hierarchical Data in MySQL for some more information.

    This article applies to more than just MySQL of course. I personally really like the 'nested set' structure for hierarchies that don't change a lot. Enjoy!

    Update: I've just discovered a slightly more verbose example that compares the adjacency list model and the nested set.

    One thing I haven't seen discussed, yet, is the idea of adding another 'dimension' to the tree. There is a 'left' and 'right' for the nested set values, but what about adding a 'z' field? (depth) I wonder if the penalties of updating the nodes could be reduced with something like that. Just thinking out loud.

    Kurt

    PS: What better way to warm up your fingers than hammering out some Perl code! :)

Re: database table advice
by jhourcle (Prior) on Mar 27, 2007 at 04:47 UTC

    I know, it doesn't handle the case of infinately deep categories, but in some cases, using a form of notation may be the better alternative, partcularly if you're using faceted classification in a poly-heirarchy.

    What we do in that case, is we assign each facet a letter, and we make sure that we emit the facets in a consistent order.

    So, using your example:

    A : habitat
      A1 : land animals
      A2 : aquatic animals
    B : animal size
      B1 : small animal
      B2 : large animal
    C : number of legs
      C0 : no legs
      C2 : 2 legs
      C4 : 4 legs
    D : backbone?
      D1 : vertebrate
      D2 : invertebrate
        D2.1 : invertebrate w/ exoskeleton
        D2.2 : invertebrate w/out exoskeleton
    E : diet
      E1 : carnivore
      E2 : herbavore

    So, for instance, a dog might be (A1;B2;C4;D1;E1). Parents to that would be any of : (A1;B2;C4;D1) (A1;B2;C4;E1) (A1;B2;D1;E1) (A1;C4;D1;E1) or (B2;C4;D1;E1). A worm might be (A1;B2;C0;D2.2) (no E, as I don't think diet is significant in classifying them, but i could be wrong) Parents would be : (A1;B2;C0;D2) (A1;B2;D2.2) (A1;C0;D2.2) or (B2;C0;D2.2).

    Now, to search this, we have to use SQL's 'LIKE' command, and we need to make sure that the notation is structured in a way to allow us to unambiguously specify a category. The easiest way is to end each classification within a facet with a dot (to make sure we're not matching 'A11' when we ask for 'A1'), and to prepend each facet in some delimiter (so we don't get facet AA when we want A):

      dog :  ;A1.;B2.;C4.;D1.;E1.
      worm:  ;A1.;B2.;C0.;D2.2.

    If we want all land animals : category  LIKE '%;A1.%'

    If we want all land animals with four legs: category  LIKE '%;A1.%;C4.%'

    ...

    So -- the problems? Really hard to re-arrange after the fact. Benefits? Any depth or number of facets, provided you've allocated enough space for the notation ... of course, the notation can get rather large in complex systems, and it much less efficient to use than the other methods described above.