comment on

I was a bit out of practice grokking HTML so I decided to automate downloading the Game of the Day from chessgames.com Here is HTML::TreeBuilder code which does the job. I had heard that there were more popular modules for parsing HTML but since Ican use the same API for XML and HTML, I stick with Treebuilder unless someone shows me how easy it would've been with something else:


sub game_of_day {

    my $outfile = shift or die "must supply directory to dump game to"
+;

    # retrieve http://www.chessgames.com

    my $html = get $home;

    # parse the page

    $tb->parse($html);

    my $god; # god == Game of the Day

    # make it so that text nodes are changed into nodes with tags
    # just like any other HTML aspect.
    # then they can be searched with look_down
    $tb->objectify_text;

    # Find the place in the HTML where Game of the Day is
    my $G = $tb->look_down
      (
       '_tag' => '~text',
       text   => 'Game of the Day'
      );

    # warn $G->as_HTML;

    # find _all_ tr in the lineage of the found node... I don't know a
+ 
    # way to limit the search
    my @U = $G->look_up
      (
       '_tag' => 'tr',
      );

    # by inspecting the output of $tree->dump, I saw that certain part
+s of the
    # tree had certain absolute addresses from the root of the tree.
    # I had planned a neat API allowing one to access various aspects 
+of the
    # Game of the Day, but for now, I just want the chessgame!
    my %address = 
      (
       'date' => '0.1.2.0.0.0.0.0.0.0.0.0.0.0.2.0',
       'game_url' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1',
       'white_player' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1.0',
       'black_player' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1.4',
       'game_title'   => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.3.0',
      );

   
    # debugging output
    while ( my ($k, $v) = each %address ) {
    warn " ** $k ** ", $/, $tb->address($v)->as_HTML, $/ 
    }

    # lets get the URL of the game
    my $game_url  = $tb->address($address{game_url})->attr('href');
    my ($game_id) = $game_url =~ m/(\d+)/;

    # let's get the game, faking out the web spider filter in the proc
+ess:
    my $pgn       = _get "http://www.chessgames.com/perl/nph-chesspgnd
+ownload?gid=$game_id";

    # let's save it to disk
    open F, ">$outfile" or die "error opening $outfile for writing: $!
+";
    print F $pgn;
    close(F)
    
}
[download]

Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality.

In reply to How do you scan HTML? by princepawn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.