Re: Repost of regex

Generally, it is unwise (at best) to not use a CPAN module when it exists and conforms to what you want to do. That being said... If you're trying to do a web spider, why don't you save yourself some time and use LWP::RobotUA along with use HTML::TokeParser or use HTML::Parser (my personal favorite) as other monks suggested?

Writing this task (extracting links to follow and analyzing the <META> tags) with HTML::Parser is a matter of a few lines. LWP already allows you to get the HTML. According to recipe 20.3 in The Perl Cookbook, you could also use HTML::LinkExtor to extract the links as this code shows (copied verbatim):

use HTML::LinkExtor;

$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse_file($filename);
@links = $parser->links;
foreach $linkarray (@links) {
    my @element = @$linkarray;
    my $elt_type = shift @element;                  # element type

    # possibly test whether this is an element we're interested in
    while (@element) {
        # extract the next attribute and its value
        my ($attr_name, $attr_value) = splice(@element, 0, 2);
        # ... do something with them ...
    }
}
[download]

However, in any case I've needed to do this, I also have needed to parse the HTML, so I've always used HTML::Parser for that too.

Hope this all helps a bit. Feel free to ask further questions.

Best regards

-lem, but some call me fokat

Comment on Re: Repost of regex Download Code