Generally, it is unwise (at best) to not use a CPAN module when it exists and conforms to what you want to do. That being said... If you're trying to do a web spider, why don't you save yourself some time and use LWP::RobotUA along with use HTML::TokeParser or use HTML::Parser (my personal favorite) as other monks suggested?
Writing this task (extracting links to follow and analyzing the <META> tags) with HTML::Parser is a matter of a few lines. LWP already allows you to get the HTML. According to recipe 20.3 in The Perl Cookbook, you could also use HTML::LinkExtor to extract the links as this code shows (copied verbatim):
use HTML::LinkExtor;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse_file($filename);
@links = $parser->links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element; # element type
# possibly test whether this is an element we're interested in
while (@element) {
# extract the next attribute and its value
my ($attr_name, $attr_value) = splice(@element, 0, 2);
# ... do something with them ...
}
}
However, in any case I've needed to do this, I also have needed to parse the HTML, so I've always used HTML::Parser for that too.
Hope this all helps a bit. Feel free to ask further questions.
Best regards
-lem, but some call me fokat |