Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Normalisation Question (OT but trust me, perl implementation)

by Cody Pendant (Prior)
on Aug 04, 2003 at 06:10 UTC ( [id://280571]=perlquestion: print w/replies, xml ) Need Help??

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm just considering a project which would catalogue all the episodes of a TV show using SQL.

I'm trying to think through the normalisation, particularly because a number of people play a number of different roles, for instance say an actor might direct an episode, a person who's normally a writer has a cameo role in one show and so on.

If I were just doing an "actors_play_roles" kind of normalisation, based on the assumption that actor John Smith always plays the character Jake Jones, I could just have tables like this:

ACTORS: id | Name ---------------- 88 | John Smith ROLES: id | Name ---------------- 99 | Jake Jones ACTORS_ROLES: actor_id | role_id ------------------ 88 | 99
which says that Actor 88 played role 99.

But what if he plays Jake Jones in one episode, and then Mycroft Jones, Jake's Evil Twin, in another?

What if he plays both parts in one ep?

Do I just need one big table of "people in general"? And then separate table of directors_of_episodes and actors_in_episodes and so on?

That seems over-complex, though I guess it's still normalised.

Or do I want to somehow do a three-column table like:

STUFF_PEOPLE_DO episode person function 1 | 88 | 23

where 23 is "actor", so we know that in episode 1, person 88 was an actor? And put which part he played somewhere else?

Any suggestions gratefully received.



($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Normalisation Question (OT but trust me, perl implementation)
by Abigail-II (Bishop) on Aug 04, 2003 at 07:15 UTC
    Since actors and directors are both humans, I wouldn't make a different table for actors and directors. I'd make a table people, with columns for the first name(s), last name, and perhaps a birth (and death) date - the latter of course only if you care for that information. And there would be an 'person_id' column.

    I'd make a second table 'characters', with character names, and a 'character_id' column.

    Then I'd make a third table 'plays', with at least two columns 'person_id' and 'character_id'. One might also add a column 'episode_number'. This gives you the information about which actor played which role(s), and if you add the third column, also in which episodes. One could normalize away that third column, by making the 'plays' have three columns: 'plays_id', 'person_id' and 'character_id', and put 'plays_id' vs 'episode_number' in another table.

    In yet another table, 'directors', I'd put people and episode_numbers together.

    Abigail

Re: Normalisation Question (OT but trust me, perl implementation)
by BrowserUk (Patriarch) on Aug 04, 2003 at 07:13 UTC

    Something like


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
    If I understand your problem, I can solve it! Of course, the same can be said for you.

Re: Normalisation Question (OT but trust me, perl implementation)
by sgifford (Prior) on Aug 04, 2003 at 07:33 UTC

    The point of normalization is to avoid duplicating information, to save storage and so that things can't get out of sync. For example, the simplest implementation would be to store <actor_name,role_name>, but if an actor changed their name (say an actress got married, or you realized you misspelled an actor's name), you might have to update hundreds of records, and if you missed some you'd have inconsistency.

    There's lots of different ways to normalize, with each one making tradeoffs between convenience, speed, and data duplication.

    In your case, I would probably have a PEOPLE table, an EPISODES table, and a ROLES table. I'd create "pseudo-roles" for the various non-acting roles that somebody can play, such as "director" or "writer", perhaps using a special character to indicate that. For example:

    PEOPLE
      id|name
      --+----
       1|Carrie Fisher
       2|John Smith
    
    EPISODES
      year|ep#|name
      ----+---+----
      1999|  1|Attack of the Killer Tomatoes
      1999|  2|Attack of the Killer Clowns
      2002|  1|Attack of the Clones
    
    ROLES
      id|name
      --+----
       1|Jake Jones
       2|Mycroft Jones
       3|@WRITER
       4|@DIRECTOR
    
    DID
      person|role|epyr|ep#
      ------+----+----+---
           2|   1|1999|  1    -- Smith plays Jake in 99-1
           2|   2|2002|  1    -- Smith plays Mycroft in 02-1
           1|   1|2002|  1    -- Fisher plays Jake in 02-1
           1|   3|2002|  1    -- Fisher wrote 02-1
    
    It seems like that avoids duplicating most information, keeps queries simple, and doesn't lose information.

    One flaw is that if the same actor plays the same role in many episodes, the <person,role> pair will be repeated very frequently, and if you make a mistake with which person plays a role, you may have many records to update. You could normalize that out into a PLAYED_ROLE table, then use id's from that in DID, but I think the increased query complexity isn't worth it.

    Updated: Minor formatting and wording changes.

Re: Normalisation Question (OT but trust me, perl implementation)
by dws (Chancellor) on Aug 04, 2003 at 08:15 UTC
    "Actor" and "Producer" are both roles that a Person can take on for a given Episode. An Actor (a Person) might play different Characters in a single Episode, and can even act and Produce.

    That suggests separate tables for

    • Person (person_id, person_details)
    • Role (role_id, role_details)
    • Episode (episode_id, episode_details)
    with "linking tables"
    • PersonProducesEpisode (person_id, episode_id)
    • PersonPlaysRoleInEpisode (person_id, role_id, episode_id)

    The "Actor" role gets collapsed into person, since it's implicit when an Person plays a Role in an Episode.

    This structure give you good flexibility. It even allows for a Role (a Character) being played by multiple Persons in a single Episode, as might happen when a character ages, or if there's a flashback to the character in his/her younger days.

      Assuming there's a 1:1 relationship between an Episode and its producer, the extra table for PersonProducesEpisode is not necessary. You can instead model an Episode as (episode_id, episode_producer, episode_details) where episode_producer is a foreign key to Person.

      Makeshifts last the longest.

        Assuming there's a 1:1 relationship between an Episode and its producer, ...

        You could assume that, but the cost of being wrong is rather high. By allowing for a 1:n relationship, even if it only happens very rarely, you won't have to rebuild the schema when data arrives about a multi-producer episode.

Re: Normalisation Question (OT but trust me, perl implementation)
by PodMaster (Abbot) on Aug 04, 2003 at 08:30 UTC
    gmax has written Database normalization the easier way (which resulted in DBSchema::Normalizer).

    Aside from that, you gotta clarify which Normal Form you're going for. Here's how I see it (aiming for 2NF -- sure Movies aren't Episodes, but the intent is clear enough):

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      That works, but since in this scheme Actor is a subset of Person, you don't need two tables.

      Clarification: That a Person is an Actor is implied by the fact of their portraying a Character (a Role). Hence, you can collapse Actor into Person, and use a single table.

      A Role should not need both an ActorID and a person ID, since an ActorID is keyed by a single personID, and Actor and Person share a 1-to-1 relation. Very nice presentation though.

      ___________
      Eric Hodges
Re: Normalisation Question (OT but trust me, perl implementation)
by Cody Pendant (Prior) on Aug 04, 2003 at 11:15 UTC
    Thanks all of you, you've given me a lot to think about. I really appreciate it.

    Remember all this when I come back and ask you for the query to make the join to grab all this lovely data back out again, right?



    ($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss') =~y~b-v~a-z~s; print
      Well ... since you asked. :P Let's use Abigail-II's suggestion as a model - i took liberty and made some changes though:
      mysql> select * from people;
      +----+-------+-------+-------+-------+
      | id | first | last  | birth | death |
      +----+-------+-------+-------+-------+
      |  1 | John  | Smith | NULL  | NULL  |
      |  4 | Sally | Smith | NULL  | NULL  |
      +----+-------+-------+-------+-------+
      
      mysql> select * from characters;
      +----+----------+---------+
      | id | first    | last    |
      +----+----------+---------+
      |  1 | Jake     | Jones   |
      |  2 | EvilJake | Jones   |
      |  3 | Shelley  | Sellers |
      +----+----------+---------+
      
      mysql> select * from people_characters;
      +--------+------------+
      | people | characters |
      +--------+------------+
      |      1 |          1 |
      |      1 |          2 |
      |      4 |          3 |
      +--------+------------+
      
      So, if we want SQL to fetch all characters played by John Smith, we need to join characters to people via the people_characters table:
      my $sth = $dbh->prepare(' SELECT characters.first, characters.last FROM characters INNER JOIN people_characters ON characters.id = people_characters.characters INNER JOIN people ON people_characters.people = people.id WHERE people.first = ? AND people.last = ? ORDER by characters.last, characters.first '); $sth->execute(qw(John Smith)); while (my $row = $sth->fetchrow_hashref) { print "$row->{first} $row->{last}\n"; }
      Now, if these tables were set up properly with Class::DBI, fetching the characters of a given person should be as easy as:
      use MyDBI::people; my ($person) = MyDBI::people->search(first=>'John', last=>'Smith'); my @characters = $person->characters;
      I know that's a lot to digest (and wait until you need to add and delete people and/or characters!!), but hopefully this will be useful to you when you need it. Good luck. :)

      jeffa

      L-LL-L--L-LL-L--L-LL-L--
      -R--R-RR-R--R-RR-R--R-RR
      B--B--B--B--B--B--B--B--
      H---H---H---H---H---H---
      (the triplet paradiddle with high-hat)
      
        There will always be some bizarre cases that don't fit a simple schema, but this is a good start. I also don't like tables which require two words for their name: I prefer a schema that conforms to one table, one noun.

        Here are some ideas to extend it:

        I would propose the term 'player' for anyone or anything that could act out a performance. Thus s/people/players/. Even inanimate objects and other non-people may be players; think "Cujo" or "Otto Pilot as himself" for easy examples.

        I would propose a table 'episodes' to capture multiple titles that are intended to be seen as continuations or illuminations on a single world. Thus, Star Wars has six canonical episodes, while Star Trek has a much larger number in its canon.

        There are many cases where a single character is portrayed by many different players, either as different versions of the character ("young Indiana Jones") or different components of the character ("voice of Tweaky"). I propose to refer to these, if you want to track them separately, as 'aspects' of a single character. The simplest and most common is when a single player plays 'all' aspects throughout the episode.

        A single player may perform more than one aspect and more than one character, too. Michael J Fox often plays at least two aspects of Marty McFly at once, as well as his daughter, his son, and his multiply-great-grandfather.

        I would propose that a 'role' is the conjunction of 'player' vs 'character' vs 'aspect' vs 'episode'. So s/people_characters/roles/. James Earl Jones was the voice of Darth Vader in Star Wars through episodes SW4, SW5 and SW6. That's three roles by my count.

        For more ideas, browse the IMDb.com site for a while and see if you can reverse engineer their tables.

        --
        [ e d @ h a l l e y . c c ]

Re: Normalisation Question (OT but trust me, perl implementation)
by Bagration (Initiate) on Aug 04, 2003 at 20:17 UTC
    I had a similar Problem. I have a collection of Video tapes: Movies, Star Trek episodes and so on:

    There are, relevant for your problem, the tables:

    PERSONS, PERSON_TITLE, TITLES, CONTRIBUTORS

    Sometimes you have persons as actors in one episode and as director in another.

    Examples:

    PERSONS
    +------+------------+---------------+
    | prNR | prLASTNAME | prNAME |
    +------+------------+---------------+
    | 76 | McNeill | Robert Duncan |
    +------+------------+---------------+

    PERSON_TITLE

    +-------+------+------+------+
    | PT_NR | prNR | ttNR | cnNR |
    +-------+------+------+------+
    | 542 | 76 | 242 | 2 |
    +-------+------+------+------+

    cnNR stands for the number in CONTRIBUTORS.

    This table looks as follows:
    +------+-----------------+
    | cnNR | cnFUNCTIONALITY |
    +------+-----------------+
    | 1 | Actor |
    | 2 | Director |

    Table Titles

    +------+----------------------+------+
    | ttNR | ttTITLE | dtNR |
    +------+----------------------+------+
    | 242 | Voyager - Das Ritual | 30 |
    +------+----------------------+------+
    (dtNR gives just the number of the Datacarrier and doesn't
    bother here).

    In this episode Duncan McNeill is actor and direktor as well.

    So you can guess how the PERSON_TITLE set might change:

    +-------+------+------+------+
    | PT_NR | prNR | ttNR | cnNR |
    +-------+------+------+------+
    | 527 | 76 | 242 | 1 |
    +-------+------+------+------+

    Your solution: you only need a further Field in your actors_role table.

    I hope this was a bit helpful. Sorry, having watched Star Trek in Germany, I have only the German episode names.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://280571]
Approved by Zaxo
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (2)
As of 2024-04-26 00:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found