Re: How do you scan HTML?

If it works, and you are used to it, then i think you might as well stick with it. I am used to HTML::TokeParser::Simple, which is what i used below.

One problem with your script - it will not get past this site's spider blocker. The problem is that you have to accept and store the cookie that the site hands you so that you can present it back when you want to download the pgn file. You also have to fake your user agent string to a well known browser. Here is a version that should do what you want.

use strict;
use warnings;
use Data::Dumper;

use LWP;
use HTTP::Cookies;
use HTTP::Request::Common;
use HTML::TokeParser::Simple;

my $save = shift or die "USAGE $0 [savefile]\n";

my $ua = LWP::UserAgent->new;
$ua->agent('Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101');
$ua->cookie_jar({
   file     => $ENV{HOME} . '/.cookies.txt',
   autosave => 1
});

my $request = GET('http://www.chessgames.com');
my $response = $ua->request($request);

my %game;
my $parser = HTML::TokeParser::Simple->new(\$response->content);

while (my $token = $parser->get_token) {
   next unless $token->is_comment;
   if ($token->as_is =~ /begintoday/) {
      $token = $parser->get_token;
      $game{date} = $token->as_is;
   }

   next unless $token->is_comment;
   if ($token->as_is =~ /begingameotd/) {
      $token = $parser->get_token;
      ($game{gid}) = $token->return_attr->{href} =~ /(\d+)$/;
      $token = $parser->get_token;
      $game{black} = $token->as_is;
      $token = $parser->get_token for 1..7;      # evil, but works
      ($game{white}) = $token->as_is =~ /(\w+)/;
      $token = $parser->get_token for 1..4;      # ditto
      $game{title} = $token->as_is;
   }
}

print Dumper \%game;
$ua->mirror("http://www.chessgames.com/perl/nph-chesspgndownload?gid=$
+game{gid}",$save);
[download]

Regardless of whether you decide to switch Parser modules, you will no doubt appreciate the part that handles cookies and the user agent string. ;)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on Re: How do you scan HTML? Download Code

Replies are listed 'Best First'.
Re: Re: How do you scan HTML? by princepawn (Parson) on Sep 22, 2003 at 15:54 UTC
no question about the useful ness of the cookie stuff, thanks... I just remembered that most people prefer XML::LibXML for this sort of thing... I would have to learn XPath to use it though. Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality.	[reply]
3Re: How do you scan HTML? by jeffa (Bishop) on Sep 22, 2003 at 16:07 UTC
I actually considered using XML::LibXML and XML::XPath to solve this. Personally, i don't like my solution above very much ... the HTML doesn't seem to match very well with my HTML::TokeParser::Simple solution (but i do wonder how Ovid would have solved it ;)). I have a few nodes that demonstrate XPath: XML::Generator::DBI Tutorial (example 2) (jeffa) 3Re: Calling nested containers with XML::Simple (jeffa) Re: Read XML, Create Dir if not exist (jeffa) Re: CSV Zero Length?? (jeffa) Re: Writing Out XML using XML::Simple Re: HTML parsing using RegEx, HTML::Parser and or HTML::TokeParser? Also, i find zvon.org's XPath Tutorial to be quite excellent. XPath is worth learning. ;) jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: 3Re: How do you scan HTML? by Ovid (Cardinal) on Sep 22, 2003 at 19:43 UTC
jeffa wrote: but I do wonder how Ovid would have solved it. Ovid replies: `#!/usr/bin/perl -w use strict; use WWW::Mechanize; my $browser = WWW::Mechanize->new(); $browser->get('http://www.chessgames.com'); $browser->follow_link( url_regex => qr/chessgame\?gid=\d+/ ); $browser->follow_link( text => 'view text' ); print $browser->content;` [download] I think that's a wee bit easier to follow :) Cheers, Ovid New address of my CGI Course.	[reply] [d/l]