princepawn has asked for the wisdom of the Perl Monks concerning the following question:

I was a bit out of practice grokking HTML so I decided to automate downloading the Game of the Day from chessgames.com Here is HTML::TreeBuilder code which does the job. I had heard that there were more popular modules for parsing HTML but since Ican use the same API for XML and HTML, I stick with Treebuilder unless someone shows me how easy it would've been with something else:
sub game_of_day { my $outfile = shift or die "must supply directory to dump game to" +; # retrieve http://www.chessgames.com my $html = get $home; # parse the page $tb->parse($html); my $god; # god == Game of the Day # make it so that text nodes are changed into nodes with tags # just like any other HTML aspect. # then they can be searched with look_down $tb->objectify_text; # Find the place in the HTML where Game of the Day is my $G = $tb->look_down ( '_tag' => '~text', text => 'Game of the Day' ); # warn $G->as_HTML; # find _all_ tr in the lineage of the found node... I don't know a + # way to limit the search my @U = $G->look_up ( '_tag' => 'tr', ); # by inspecting the output of $tree->dump, I saw that certain part +s of the # tree had certain absolute addresses from the root of the tree. # I had planned a neat API allowing one to access various aspects +of the # Game of the Day, but for now, I just want the chessgame! my %address = ( 'date' => '0.1.2.0.0.0.0.0.0.0.0.0.0.0.2.0', 'game_url' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1', 'white_player' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1.0', 'black_player' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.1.4', 'game_title' => '0.1.2.0.0.0.0.0.0.0.0.1.0.0.0.3.0', ); # debugging output while ( my ($k, $v) = each %address ) { warn " ** $k ** ", $/, $tb->address($v)->as_HTML, $/ } # lets get the URL of the game my $game_url = $tb->address($address{game_url})->attr('href'); my ($game_id) = $game_url =~ m/(\d+)/; # let's get the game, faking out the web spider filter in the proc +ess: my $pgn = _get "http://www.chessgames.com/perl/nph-chesspgnd +ownload?gid=$game_id"; # let's save it to disk open F, ">$outfile" or die "error opening $outfile for writing: $! +"; print F $pgn; close(F) }

Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality.

Replies are listed 'Best First'.
Re: How do you scan HTML?
by jeffa (Bishop) on Sep 19, 2003 at 16:23 UTC

    If it works, and you are used to it, then i think you might as well stick with it. I am used to HTML::TokeParser::Simple, which is what i used below.

    One problem with your script - it will not get past this site's spider blocker. The problem is that you have to accept and store the cookie that the site hands you so that you can present it back when you want to download the pgn file. You also have to fake your user agent string to a well known browser. Here is a version that should do what you want.

    use strict; use warnings; use Data::Dumper; use LWP; use HTTP::Cookies; use HTTP::Request::Common; use HTML::TokeParser::Simple; my $save = shift or die "USAGE $0 [savefile]\n"; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101'); $ua->cookie_jar({ file => $ENV{HOME} . '/.cookies.txt', autosave => 1 }); my $request = GET('http://www.chessgames.com'); my $response = $ua->request($request); my %game; my $parser = HTML::TokeParser::Simple->new(\$response->content); while (my $token = $parser->get_token) { next unless $token->is_comment; if ($token->as_is =~ /begintoday/) { $token = $parser->get_token; $game{date} = $token->as_is; } next unless $token->is_comment; if ($token->as_is =~ /begingameotd/) { $token = $parser->get_token; ($game{gid}) = $token->return_attr->{href} =~ /(\d+)$/; $token = $parser->get_token; $game{black} = $token->as_is; $token = $parser->get_token for 1..7; # evil, but works ($game{white}) = $token->as_is =~ /(\w+)/; $token = $parser->get_token for 1..4; # ditto $game{title} = $token->as_is; } } print Dumper \%game; $ua->mirror("http://www.chessgames.com/perl/nph-chesspgndownload?gid=$ +game{gid}",$save);
    Regardless of whether you decide to switch Parser modules, you will no doubt appreciate the part that handles cookies and the user agent string. ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      no question about the useful ness of the cookie stuff, thanks... I just remembered that most people prefer XML::LibXML for this sort of thing... I would have to learn XPath to use it though.

      Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality.

Re: How do you scan HTML?
by PodMaster (Abbot) on Sep 19, 2003 at 10:25 UTC
    If you can live with the memory footprint stick with it. I would personally use HTML::TokeParser::Simple (or HTML::Parser if it was called for). It too has an XML counterpart (XML::TokeParser). I do not care to provide an example (there are tons already --> super search -- and lots of them are in "how do you scan html" type threads).

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.