Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:
I'm just considering a project which would catalogue all the episodes of a TV show using SQL.
I'm trying to think through the normalisation, particularly because a number of people play a number of different roles, for instance say an actor might direct an episode, a person who's normally a writer has a cameo role in one show and so on.
If I were just doing an "actors_play_roles" kind of normalisation, based on the assumption that actor John Smith always plays the character Jake Jones, I could just have tables like this:
ACTORS:
id | Name
----------------
88 | John Smith
ROLES:
id | Name
----------------
99 | Jake Jones
ACTORS_ROLES:
actor_id | role_id
------------------
88 | 99
which says that Actor 88 played role 99.
But what if he plays Jake Jones in one episode, and then Mycroft Jones, Jake's Evil Twin, in another?
What if he plays both parts in one ep?
Do I just need one big table of "people in general"? And then separate table of directors_of_episodes and actors_in_episodes and so on?
That seems over-complex, though I guess it's still normalised.
Or do I want to somehow do a three-column table like:
STUFF_PEOPLE_DO
episode person function
1 | 88 | 23
where 23 is "actor", so we know that in episode 1, person 88 was an actor? And put which part he played somewhere else?
Any suggestions gratefully received.
($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print
Re: Normalisation Question (OT but trust me, perl implementation)
by Abigail-II (Bishop) on Aug 04, 2003 at 07:15 UTC
|
Since actors and directors are both humans, I wouldn't make
a different table for actors and directors. I'd make a table
people, with columns for the first name(s), last name, and
perhaps a birth (and death) date - the latter of course only
if you care for that information. And there would be an
'person_id' column.
I'd make a second table 'characters', with character names,
and a 'character_id' column.
Then I'd make a third table 'plays', with at least two
columns 'person_id' and 'character_id'. One might also add
a column 'episode_number'. This gives you the information
about which actor played which role(s), and if you add the
third column, also in which episodes. One could normalize
away that third column, by making the 'plays' have three
columns: 'plays_id', 'person_id' and 'character_id', and
put 'plays_id' vs 'episode_number' in another table.
In yet another table, 'directors', I'd put people and
episode_numbers together.
Abigail | [reply] |
Re: Normalisation Question (OT but trust me, perl implementation)
by BrowserUk (Patriarch) on Aug 04, 2003 at 07:13 UTC
|
| [reply] [d/l] |
Re: Normalisation Question (OT but trust me, perl implementation)
by sgifford (Prior) on Aug 04, 2003 at 07:33 UTC
|
The point of normalization is to avoid duplicating information, to save storage and so that things can't get out of sync. For example, the simplest implementation would be to store <actor_name,role_name>, but if an actor changed their name (say an actress got married, or you realized you misspelled an actor's name), you might have to update hundreds of records, and if you missed some you'd have inconsistency.
There's lots of different ways to normalize, with each one making tradeoffs between convenience, speed, and data duplication.
In your case, I would probably have a PEOPLE table, an EPISODES table, and a ROLES table. I'd create "pseudo-roles" for the various non-acting roles that somebody can play, such as "director" or "writer", perhaps using a special character to indicate that. For example:
PEOPLE
id|name
--+----
1|Carrie Fisher
2|John Smith
EPISODES
year|ep#|name
----+---+----
1999| 1|Attack of the Killer Tomatoes
1999| 2|Attack of the Killer Clowns
2002| 1|Attack of the Clones
ROLES
id|name
--+----
1|Jake Jones
2|Mycroft Jones
3|@WRITER
4|@DIRECTOR
DID
person|role|epyr|ep#
------+----+----+---
2| 1|1999| 1 -- Smith plays Jake in 99-1
2| 2|2002| 1 -- Smith plays Mycroft in 02-1
1| 1|2002| 1 -- Fisher plays Jake in 02-1
1| 3|2002| 1 -- Fisher wrote 02-1
It seems like that avoids duplicating most information, keeps queries simple, and doesn't lose information.
One flaw is that if the same actor plays the same role in many episodes, the <person,role> pair will be repeated very frequently, and if you make a mistake with which person plays a role, you may have many records to update. You could normalize that out into a PLAYED_ROLE table, then use id's from that in DID, but I think the increased query complexity isn't worth it.
Updated: Minor formatting and wording changes.
| [reply] |
Re: Normalisation Question (OT but trust me, perl implementation)
by dws (Chancellor) on Aug 04, 2003 at 08:15 UTC
|
"Actor" and "Producer" are both roles that a Person can take on for a given Episode. An Actor (a Person) might play different Characters in a single Episode, and can even act and Produce.
That suggests separate tables for
- Person (person_id, person_details)
- Role (role_id, role_details)
- Episode (episode_id, episode_details)
with "linking tables"
- PersonProducesEpisode (person_id, episode_id)
- PersonPlaysRoleInEpisode (person_id, role_id, episode_id)
The "Actor" role gets collapsed into person, since it's implicit when an Person plays a Role in an Episode.
This structure give you good flexibility. It even allows for a Role (a Character) being played by multiple Persons in a single Episode, as might happen when a character ages, or if there's a flashback to the character in his/her younger days.
| [reply] |
|
| [reply] |
|
Assuming there's a 1:1 relationship between an Episode and its producer, ...
You could assume that, but the cost of being wrong is rather high. By allowing for a 1:n relationship, even if it only happens very rarely, you won't have to rebuild the schema when data arrives about a multi-producer episode.
| [reply] |
Re: Normalisation Question (OT but trust me, perl implementation)
by PodMaster (Abbot) on Aug 04, 2003 at 08:30 UTC
|
gmax has written Database normalization the easier way (which resulted in DBSchema::Normalizer).
Aside from that, you gotta clarify which Normal Form you're going for.
Here's how I see it (aiming for 2NF -- sure Movies aren't Episodes, but the intent is clear enough):
MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] [d/l] |
|
| [reply] |
|
| [reply] |
Re: Normalisation Question (OT but trust me, perl implementation)
by Cody Pendant (Prior) on Aug 04, 2003 at 11:15 UTC
|
Thanks all of you, you've given me a lot to think about. I really appreciate it.
Remember all this when I come back and ask you for the query to make the join to grab all this lovely data back out again, right?
($_='kkvvttuubbooppuuiiffssqqffssmmiibbddllffss')
=~y~b-v~a-z~s; print
| [reply] [d/l] |
|
my $sth = $dbh->prepare('
SELECT characters.first, characters.last
FROM characters
INNER JOIN people_characters
ON characters.id = people_characters.characters
INNER JOIN people
ON people_characters.people = people.id
WHERE people.first = ?
AND people.last = ?
ORDER by characters.last, characters.first
');
$sth->execute(qw(John Smith));
while (my $row = $sth->fetchrow_hashref) {
print "$row->{first} $row->{last}\n";
}
Now, if these tables were set up properly with
Class::DBI, fetching the characters of a given
person should be as easy as:
use MyDBI::people;
my ($person) = MyDBI::people->search(first=>'John', last=>'Smith');
my @characters = $person->characters;
I know that's a lot to digest (and wait until you need to
add and delete people and/or characters!!), but hopefully
this will be useful to you when you need it. Good luck. :)
jeffa
L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)
| [reply] [d/l] [select] |
|
There will always be some bizarre cases that don't fit a simple schema, but this is a good start. I also don't like tables which require two words for their name: I prefer a schema that conforms to one table, one noun.
Here are some ideas to extend it:
I would propose the term 'player' for anyone or anything that could act out a performance. Thus s/people/players/. Even inanimate objects and other non-people may be players; think "Cujo" or "Otto Pilot as himself" for easy examples.
I would propose a table 'episodes' to capture multiple titles that are intended to be seen as continuations or illuminations on a single world. Thus, Star Wars has six canonical episodes, while Star Trek has a much larger number in its canon.
There are many cases where a single character is portrayed by many different players, either as different versions of the character ("young Indiana Jones") or different components of the character ("voice of Tweaky"). I propose to refer to these, if you want to track them separately, as 'aspects' of a single character. The simplest and most common is when a single player plays 'all' aspects throughout the episode.
A single player may perform more than one aspect and more than one character, too. Michael J Fox often plays at least two aspects of Marty McFly at once, as well as his daughter, his son, and his multiply-great-grandfather.
I would propose that a 'role' is the conjunction of 'player' vs 'character' vs 'aspect' vs 'episode'. So s/people_characters/roles/. James Earl Jones was the voice of Darth Vader in Star Wars through episodes SW4, SW5 and SW6. That's three roles by my count.
For more ideas, browse the IMDb.com site for a while and see if you can reverse engineer their tables.
-- [ e d @ h a l l e y . c c ]
| [reply] |
|
Re: Normalisation Question (OT but trust me, perl implementation)
by Bagration (Initiate) on Aug 04, 2003 at 20:17 UTC
|
I had a similar Problem. I have a collection of Video tapes:
Movies, Star Trek episodes and so on:
There are, relevant for your problem, the tables:
PERSONS, PERSON_TITLE, TITLES, CONTRIBUTORS
Sometimes you have persons as actors in one episode and as
director in another.
Examples:
PERSONS
+------+------------+---------------+
| prNR | prLASTNAME | prNAME |
+------+------------+---------------+
| 76 | McNeill | Robert Duncan |
+------+------------+---------------+
PERSON_TITLE
+-------+------+------+------+
| PT_NR | prNR | ttNR | cnNR |
+-------+------+------+------+
| 542 | 76 | 242 | 2 |
+-------+------+------+------+
cnNR stands for the number in CONTRIBUTORS.
This table looks as follows:
+------+-----------------+
| cnNR | cnFUNCTIONALITY |
+------+-----------------+
| 1 | Actor |
| 2 | Director |
Table Titles
+------+----------------------+------+
| ttNR | ttTITLE | dtNR |
+------+----------------------+------+
| 242 | Voyager - Das Ritual | 30 |
+------+----------------------+------+
(dtNR gives just the number of the Datacarrier and doesn't
bother here).
In this episode Duncan McNeill is actor and direktor as well.
So you can guess how the PERSON_TITLE set might change:
+-------+------+------+------+
| PT_NR | prNR | ttNR | cnNR |
+-------+------+------+------+
| 527 | 76 | 242 | 1 |
+-------+------+------+------+
Your solution: you only need a further Field in your
actors_role table.
I hope this was a bit helpful. Sorry, having watched Star Trek in
Germany, I have only the German episode names. | [reply] |
|
|