comment on

If it works, and you are used to it, then i think you might as well stick with it. I am used to HTML::TokeParser::Simple, which is what i used below.

One problem with your script - it will not get past this site's spider blocker. The problem is that you have to accept and store the cookie that the site hands you so that you can present it back when you want to download the pgn file. You also have to fake your user agent string to a well known browser. Here is a version that should do what you want.

use strict;
use warnings;
use Data::Dumper;

use LWP;
use HTTP::Cookies;
use HTTP::Request::Common;
use HTML::TokeParser::Simple;

my $save = shift or die "USAGE $0 [savefile]\n";

my $ua = LWP::UserAgent->new;
$ua->agent('Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101');
$ua->cookie_jar({
   file     => $ENV{HOME} . '/.cookies.txt',
   autosave => 1
});

my $request = GET('http://www.chessgames.com');
my $response = $ua->request($request);

my %game;
my $parser = HTML::TokeParser::Simple->new(\$response->content);

while (my $token = $parser->get_token) {
   next unless $token->is_comment;
   if ($token->as_is =~ /begintoday/) {
      $token = $parser->get_token;
      $game{date} = $token->as_is;
   }

   next unless $token->is_comment;
   if ($token->as_is =~ /begingameotd/) {
      $token = $parser->get_token;
      ($game{gid}) = $token->return_attr->{href} =~ /(\d+)$/;
      $token = $parser->get_token;
      $game{black} = $token->as_is;
      $token = $parser->get_token for 1..7;      # evil, but works
      ($game{white}) = $token->as_is =~ /(\w+)/;
      $token = $parser->get_token for 1..4;      # ditto
      $game{title} = $token->as_is;
   }
}

print Dumper \%game;
$ua->mirror("http://www.chessgames.com/perl/nph-chesspgndownload?gid=$
+game{gid}",$save);
[download]

Regardless of whether you decide to switch Parser modules, you will no doubt appreciate the part that handles cookies and the user agent string. ;)

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

In reply to Re: How do you scan HTML? by jeffa
in thread How do you scan HTML? by princepawn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.