Cody Fendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a robust way of detecting which season and episode a TV show is from.

Common patterns are things like "S5E6", "S5.E6", or "S05xE06" but there's no real consistency.

Also I want to fall back to detecting a date for shows like The Daily Show, which don't really have seasons/episodes.

So here's my first attempt, please comment. Also if you're interested, please test on bulk data from TV torrent websites, which is what I'm doing right now.

One straightforward question: what's the best-practice way to remove leading zeros? Convert to a number with something like $foo = ($foo + 0)? Strip the zeros as characters? sprintf?

#!/usr/local/bin/perl use strict; use warnings; use Data::Dumper::Simple; use Regexp::Common qw(time); my ( $count, $success, $failure ) = ( 0, 0, 0 ); while (<DATA>) { chomp; $count++; my $data = extract_show_info($_); if ($data) { print "Success? \n"; print Dumper($data); $success++; } else { print "Failure: $_\n"; $failure++; } } print "Processed: $count | Successes: $success | Failures: $failure \n +"; sub extract_show_info { my $input_string = shift(); my $result = undef; if ( $result = extract_episode_data($input_string) ) { $result->{type} = 'se'; } elsif ( my @date = $_ =~ /$RE{time}{ymd}{-keep}/ ) { $result = { type => 'date', year => $date[1], month => $date[2], day => $date[3] }; } return $result; } sub extract_episode_data { my $input_string = shift(); if ( $input_string =~ /s(\d+)\s*e(\d+)/i || $input_string =~ /s(\d+)\.e(\d+)/i || $input_string =~ /(\d+)x(\d+)/i || $input_string =~ /Season\s*(\d+),?\s*Episode\s*(\d+)/i || $input_string =~ /Series\.(\d+)\.(\d+)/ ) { my $episode_data = { season => $1, episode => $2 }; return $episode_data; } else { return; } } __DATA__ The.Walking.Dead.S01E03.FRENCH.LD.BDRip.XviD-JMT.avi 348.55 Mb Gogglebox.AU.s01e08.PDTV.x264.Hector.mp4 266.46 Mb Power S03E01 HDTV x264-FS.mp4 285.38 Mb Wentworth.s03e04.HDTV.x264.Hector.mp4 226.32 Mb Suits.S06E03.HDTV.x264-FUM[eztv].mp4 222.37 Mb Killjoys.S02E07.HDTV.x264-FUM[eztv].mp4 255.05 Mb Superfoods.The.Real.Story.Series.2.4of8.Seaweed.720p.HDTV.x264.AACmp4[ +eztv].mp4 439.43 Mb Keeping.Up.With.The.Kardashians.S12E01.Out.With.The.Old.In.With.The.Ne +w.HDTV-MEGATV.mp4 445.27 Mb Keeping.Up.With.The.Kardashians.S12E04.All.About.Meme.HDTV-MEGATV.mp4 +416.17 Mb Are You the One S04E08 HDTV x264-Nada.mp4 476.85 Mb Superfoods.The.Real.Story.Series.2.8of8.Avocados.720p.HDTV.x264.AAC.MV +Group.org.mp4 430.01 Mb Kingdom 2014 S02E20 No Sharp Objects HDTV x264-TTL.mp4 457.65 Mb The.Big.Bang.Theory.S09E19.HDTV.x264-LOL[eztv].mp4 144.79 Mb Superfoods.The.Real.Story.Series.2.4of8.Seaweed.720p.HDTV.x264.AAC.MVG +roup.org.mp4 439.43 Mb [www.Cpasbien.pe] Vikings.S02E07.FRENCH.HDTV.x264-DEAL.mp4 309.86 Mb BBC.Inside.Einsteins.Mind.1080p.HDTV.x265.AAC.MVGroup.Forum.mp4 728.65 + Mb
  • Comment on Please review this: code to extract the season/episode or date from a TV show's title on a torrent site
  • Select or Download Code

Replies are listed 'Best First'.
Re: Please review this: code to extract the season/episode or date from a TV show's title on a torrent site
by Anonymous Monk on Aug 18, 2016 at 07:39 UTC

    About 0-stripping, if you are going to use the value as a number, I would got with + 0; else s/^0+//. (Perl, as you know, would convert the string to number if needed.)

Re: Please review this: code to extract the season/episode or date from a TV show's title on a torrent site
by Anonymous Monk on Aug 18, 2016 at 08:09 UTC

    If you are going to return a hash reference from extract_episode_data() ...

    sub extract_show_info { my $input_string = shift(); my $result = undef; if ( $result = extract_episode_data($input_string) ) { $result->{type} = 'se'; } elsif ( my @date = $_ =~ /$RE{time}{ymd}{-keep}/ ) { $result = { ... }; } return $result; } sub extract_episode_data { my $input_string = shift(); if ( ... ) { my $episode_data = { season => $1, episode => $2 }; return $episode_data; } else { return; } }

    ... why not set the type in there too? That would lead to something like ...

    sub extract_show_info { my $input_string = shift @_; my $result = extract_episode_data($input_string); $result and return $result; if ( my @date = $_ =~ /$RE{time}{ymd}{-keep}/ ) { return { ... }; } return; } sub extract_episode_data { my $input_string = shift @_; if ( ... ) { return { type => 'se', season => $1, episode => $2 }; } return; }
      ... why not set the type in there too?

      Makes sense, but I was trying to keep the two completely separate, de-coupled or whatever the right word is. Then I can re-use the season-episode sub cleanly for something else? Maybe I'm over-thinking.

Re: Please review this: code to extract the season/episode or date from a TV show's title on a torrent site
by Anonymous Monk on Aug 18, 2016 at 08:39 UTC

    Note to self: Regexp::Common::time provides the time regex, not Regexp::Common.

    One would be lucky to always have the date as year-month-day as the only variation instead of other two. So I take it then the files not matching your season-episode regex, would have the date only in that format?.

      That's a really tricky question.

      I don't see many other date formats, and there's really no way, in code at least, to deal with the possibility that someone has got the month and date the wrong way round and their August 1 is really January 8.

        You could look at consecutively-numbered episodes and see if they are 1 week (or whatever) apart. Or at least that each later-numbered episode has a later date.

        Yup ... may need to account for idiosyncrasies per provider, say by assigning a different regex/parser.