I was a bit out of practice grokking HTML so I decided to automate downloading the Game of the Day from chessgames.com Here is HTML::TreeBuilder code which does the job. I had heard that there were more popular modules for parsing HTML but since Ican use the same API for XML and HTML, I stick with Treebuilder unless someone shows me how easy it would've been with something else:
sub game_of_day { my $outfile = shift or die "must supply directory to dump game to" +; # retrieve http://www.chessgames.com my $html = get $home; # parse the page $tb->parse($html); my $god; # god == Game of the Day # make it so that text nodes are changed into nodes with tags # just like any other HTML aspect. # then they can be searched with look_down $tb->objectify_text; # Find the place in the HTML where Game of the Day is my $G = $tb->look_down ( '_tag' => '~text', text => 'Game of the Day' ); # warn $G->as_HTML; # find _all_ tr in the lineage of the found node... I don't know a + # way to limit the search my @U = $G->look_up ( '_tag' => 'tr', ); # by inspecting the output of $tree->dump, I saw that certain part +s of the # tree had certain absolute addresses from the root of the tree. # I had planned a neat API allowing one to access various aspects +of the # Game of the Day, but for now, I just want the chessgame! my %address = ( 'date' => '0.1.2.0.0.0.0.0.0.0.0.0.0.0.2.0', 'game_url' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1', 'white_player' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1.0', 'black_player' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1.4', 'game_title' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.3.0', ); # debugging output while ( my ($k, $v) = each %address ) { warn " ** $k ** ", $/, $tb->address($v)->as_HTML, $/ } # lets get the URL of the game my $game_url = $tb->address($address{game_url})->attr('href'); my ($game_id) = $game_url =~ m/(\d+)/; # let's get the game, faking out the web spider filter in the proc +ess: my $pgn = _get "http://www.chessgames.com/perl/nph-chesspgnd +ownload?gid=$game_id"; # let's save it to disk open F, ">$outfile" or die "error opening $outfile for writing: $! +"; print F $pgn; close(F) }

Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality.


In reply to How do you scan HTML? by princepawn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.