Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am tring to make a spider, but I am not really good at the whole regex thing. I am tring to get description of the meta tag without TokeParser. I tried this, but I still couldn't get the description:
$html = "<html><meta name=\"description\" content=\"sf\"></html>"; $html =~ /<meta.+?name\s*=\s*("|')description\1\.+?content\s*=\s*("|') +(.*?)\2/; print $3;

Replies are listed 'Best First'.
Re: Repost of regex
by Abigail-II (Bishop) on Feb 02, 2003 at 23:35 UTC
    Is there a reason why you don't want to use TokeParse, or another CPAN module that does HTML parsing?

    I mean, it can't be because you can do better (otherwise, you wouldn't have failed).

    Abigail

Re: Repost of regex
by fokat (Deacon) on Feb 03, 2003 at 03:14 UTC

    Generally, it is unwise (at best) to not use a CPAN module when it exists and conforms to what you want to do. That being said... If you're trying to do a web spider, why don't you save yourself some time and use LWP::RobotUA along with use HTML::TokeParser or use HTML::Parser (my personal favorite) as other monks suggested?

    Writing this task (extracting links to follow and analyzing the <META> tags) with HTML::Parser is a matter of a few lines. LWP already allows you to get the HTML. According to recipe 20.3 in The Perl Cookbook, you could also use HTML::LinkExtor to extract the links as this code shows (copied verbatim):

    use HTML::LinkExtor; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse_file($filename); @links = $parser->links; foreach $linkarray (@links) { my @element = @$linkarray; my $elt_type = shift @element; # element type # possibly test whether this is an element we're interested in while (@element) { # extract the next attribute and its value my ($attr_name, $attr_value) = splice(@element, 0, 2); # ... do something with them ... } }

    However, in any case I've needed to do this, I also have needed to parse the HTML, so I've always used HTML::Parser for that too.

    Hope this all helps a bit. Feel free to ask further questions.

    Best regards

    -lem, but some call me fokat

Re: Repost of regex
by ihb (Deacon) on Feb 02, 2003 at 21:05 UTC

    You've managed to escape the dot after \1.

    There are a couple of other direct problems with your pattern, but I won't even go there, since this is a dead end anyway.

    ihb