Category: Audio
Author/Contact Info munk@munkboxen.mine.nu
Description: Fetches a list of albums from the CDDB website matching the search string entered on the command line.

An individual album can then be selected from this list so that the track listing for that album can be 'dumped' into a file in the current directory.

With additional arguments, the script will also vary the number of album titles per page to display.
Update: now substitutes HTML character entity references for ascii counterparts

#!/usr/bin/perl
my $progname = $0;
$progname =~ s,.*/,,;  # use basename only
my $version = "0.2";

use strict;
use LWP::UserAgent;

# these are configurable - they may change from time to time according
+ to
# CDDB.COM website file system structure:
my $base_url="http://www.gracenote.com";
my $search_uri="/php/search-adv.php3?q=";

# default items to show per page:
my $page_count=10;

# current result/album+artist item:
my $page_curr=1;

# for debugging:
my $debug;

# build query list from args:
my $query_list=join("+", @ARGV);

# if -h flag set or no args/search list, show usage:
if($query_list =~ /-[hH]/ || $query_list eq "") { usage(); }

if($query_list =~ s/\+?-n\+(\d+)\+?//){
    # a number to show per page is given
    # if number / page is > 50, show usage (max per page from cddb.com
+ is 50):
    if($1 > 50){ usage();}

    $page_count=$1;
}

# debug mode?:
if( $query_list =~ s/\+?-d\+?// ){
    $debug=1;
}

# build query url:
my $query_url=$base_url.$search_uri.$query_list."&f=all&s=$page_curr&n
+=$page_count";

print "Query URL: $query_url\n" if $debug;

# start off with first url:
main($query_url);

# this sub is called recursively, once for each 'page' of results ($pa
+ge_curr to ($page_curr + $page_count)):
sub main(){
    # run the query on the query url:
    my $result=get_url(shift);

    if($result->is_success){
        # we got a result, parse it:
        my @result_lines = split("\n", $result->content);

        # strip out the album/artist pairs from the results page:
        my (@album_url) = get_album_url(@result_lines);

        # display results for user to choose an album:
        &choose_album(@album_url) ;
    } else {
        die("
Error retrieving $query_url.

Check and compare the base search URL, \$base_url (=$base_url),
and the search URI, \$search_uri (=$search_uri),
in the code against the currently working url/uri at gracenote.com\n\n
+");
    }
}

sub get_url(){
    my $url = shift;

    # create user agent object:
    my $cddb_ua = new LWP::UserAgent;
    $cddb_ua->agent("$progname/0.1 ");

    # build the request object:
    my $cddb_req = new HTTP::Request GET => $url;

    # make the request:
    return my $cddb_res = $cddb_ua->request($cddb_req);
}


# sub returns a hash of url -> artists / album names:
sub get_album_url(){
    my ($list_started, $list_ended, @result_list);

    foreach (@_){
        # does this line tell us what page we're looking at
        # ie: <p><font size=2 face="Arial,Helvetica">
        # Displaying disc 1-10 of 2542 matching CDs</font></p>
        (/<p>.*?(Displaying disc .*? of .*? matching CDs).*?<\/p>/)&& 
+(push @result_list, $1);

        # is this start of list?:
        (/<!-- LIST START -->/) && ($list_started = 1) && (next);
        (/<!-- LIST END -->/) && ($list_ended);

        # save this list item into array:
        if($list_started && !$list_ended){
            # a list item looks like this:
            # <LI type=circle><FONT face="Arial, Helvetica, sans-serif
+"><B><!-- START ITEM --><!-- REL 100 ENDREL --><A HREF="/xm/pcd/genhi
+phop/e004abd1bd74777c1ad8ec4088ec67c7.html" >The Beastie Boys / Hello
+ Nasty</A><!-- END ITEM -->         </B></FONT> <br>&nbsp;&nbsp;&nbsp
+;Just A <b>Test</b><BR>
            # strip out urls / album title/artist:
            /A HREF="(.*)" >(.*)<\/A>/;
            my $tmp="$1##$2";            # me being stupid and forgett
+ing how to use hashes ;)
            push(@result_list, $tmp);
        }
    }
    # make sure the first item in @result_list is the 'Displaying disc
+ x of n matching CDs
    ($result_list[0] =~ /^Displaying disc/) || die("Unable to retrieve
+ paging info\n");

    return @result_list;
}


sub choose_album(){
    my $page_info=shift;
    my @album_url=@_;

    my $last_page;

    # print paging info:
    print $page_info,"\n";

    for(my $i=0; $i < $page_count; $i++){
        my (undef, $album) = split "##", $album_url[$i];
        printf("%2s. %s\n", $i+1, $album);
    }

    if(scalar(@album_url) < $page_count ){
        $page_count=@album_url;
        $last_page=1;
    }

    print "Select album (0, ..., $page_count)\n";
    print "'q' to quit\n";
    $last_page ? "" : print "Any other key for more...\n";

    while(<STDIN>){
        chomp;
        if(/^(\d+)$/){
            get_track_listing($album_url[$1-1]);
            exit;
        } elsif(/[qQ]/) {
            exit;
        } else {
            # increment current item by $page_count:
            $page_curr=$page_curr+$page_count;
            my $query_url=$base_url.$search_uri.$query_list."&f=all&s=
+$page_curr&n=$page_count";
            # add $page_count onto $page_curr in URL

            # works ok but we want $page_curr globally accessible
            #$query_url=~s/&s=(\d+)&/"&s=".int($1+$page_count)."&"/e;


            $query_url=~s/&s=(\d+)&/&s=$page_curr&/;

            &main($query_url);
        }
    }
}

sub get_track_listing(){
    my ($uri, $album_artist)= split "##", shift;
    my $url=$base_url.$uri;
    my ($artist, $album) = split " / ", $album_artist;
    my $outfile = $album." - ".$artist.".txt";

    # fetch the page containing the track list:
    my $result = &get_url($url);

    # open the output file for printing track list to:
    open(OUTFILE, ">$outfile") || die("Unable to open $outfile for wri
+ting\n");

    if($result->is_success){
        # we got the html page containing the track list ok,
        # parse out the track listing now.
        # track items look like this:
        # <LI><B><FONT face="Arial, Helvetica, sans-serif" >Super Disc
+o Breakin&#39;</FONT></B><br>
        my @result_lines=split("\n", $result->content);
        foreach(@result_lines){
            if( m#<LI><B><FONT.*? >(.*?)</FONT></B><br>#){
                my $track=$1;

                # replace any html char entity refs:
                $track =~ s/&#(\d+);/chr($1)/ge;

                print OUTFILE $track,"\n";
                print $track,"\n" if $debug;
            }
        }
    } else {
        die("Unable to retrieve $url\n");
    }
}

sub usage{
    die<<"EOT";
Usage: $progname [-h] [-d] [-n x] keyword1 ... keywordn

Search/query the cddb.com website for CD-ROM listings including
the search keywords keyword1 to keywordn.

Invoked with argument '-h' prints this help.

Invoked with argument '-d' prints debug info.

Invoked with argument '-n x' prints x number of results per page.
Max x == 50 (max number of 'hits' per page allowed by cddb.com).

EOT
}
1;
__END__

=head1 NAME

cddb_get_tracklist.pl - search for CD discs matching keywords entered 
+on command-line.

=head1 SYNOPSIS

cddb_get_tracklist.pl david holmes

Fetch a list of all albums listed on cddb.com containing the words
'david holmes' in.
Note this searches for occurences of 'david holmes'in any of album nam
+e, artist or track titles.

=head1 DESCRIPTION

Fetches a list of albums from the CDDB website matching the search str
+ing
entered on the command line.

An individual album can then be selected from this list so that the
track listing for that album can be 'dumped' into a file in the curren
+t directory.

With additional arguments, the script will also vary the number of alb
+um titles
per page to display.

=head1 README

Author:
Jez Hancock <jez.hancock@munkboxen.mine.nu>

Date:
20020622113210

Modules used:
LWP::UserAgent

Notes:

You may want to change the output file name format, I use
'album_title - artist.txt', which is good for me, but a lot of ppl
don't like spaces in filenames... up to you...

The code isn't that hot, and no doubt there are untold bugs... feel fr
+ee
to modify the code as you like, please just mail me if you do make any
+ considerable
changes - nice to hear about offspring making it in the world ;)

The code is liable to 'break' at such time that the fine folk at http:
+//gracenote.com
decide to change the search URL/URI format.  This shouldn't be too har
+d to fix and should
just be a matter of finding out the new format and editing the strings
+ $base_url and $search_uri
accordingly below.

Wish list:
To have the numbering fixed when a user 'pages' from one screen of res
+ults to the next.
Presently, first page will show result items numbered: '1 ... 10', sec
+ond page will then
show items numbered: '1 ... 10' also.  This works ok, just an aestheti
+c thing ;)

This script is totally raw!  I only hacked it up because I couldn't fi
+nd it anywhere else
(to my surprise).  Hope others find it useful... if you do let me know
+!

Changes:
0.2:
Added code to substitute HTML character entity references to correspon
+ding ascii characters

Jez

=head1 USAGE

C<cddb_get_tracklist.pl [-h] [-d] [-n x] keyword1 ... keywordn>

Search/query the cddb.com website for CD-ROM listings including
the search keywords keyword1 to keywordn.

  Invoked with argument '-h' prints this help.

  Invoked with argument '-d' prints debug info.

  Invoked with argument '-n x' prints x number of results per page.
  max x == 50 (max number of 'hits' per page allowed by cddb.com).

=head1 PREREQUISITES

This script requires the C<LWP> module.

=head1 BUGS

HTML Character Entity References aren't translated into ascii equivale
+nts
(ie &amp; isn't translated into '@')

Minimal paging, could be tweaked.

=head1 SEE ALSO

C<http://search.cpan.org/doc/DSHULTZ/Net-CDDBScan-2.01/CDDBScan.pm>

Interesting looking PM I found only after authoring this hack.

=head1 OSNAMES

any

=head1 SCRIPT CATEGORIES

Audio/MP3

=head1 AUTHOR

Copyright 1998-2000, Jez Hancock <jez.hancock@munkboxen.mine.nu> All r
+ights reserved.

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
Address bug reports and comments to: jez.hancock@munkboxen.mine.nu

=cut
Replies are listed 'Best First'.
Re: CDDB track listing search script
by belg4mit (Prior) on Jun 28, 2002 at 02:42 UTC
    I recommend looking at the CPAN CDDB modules/ This avoids the need for screen-scraping, and allows for use of http://freedb.org (which some prefer for various reasons).

    --
    perl -pew "s/\b;([mnst])/'$1/g"