Repost of regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Repost of regex by Abigail-II (Bishop) on Feb 02, 2003 at 23:35 UTC
Is there a reason why you don't want to use TokeParse, or another CPAN module that does HTML parsing? I mean, it can't be because you can do better (otherwise, you wouldn't have failed). Abigail	[reply]
Re: Repost of regex by fokat (Deacon) on Feb 03, 2003 at 03:14 UTC
Generally, it is unwise (at best) to not use a CPAN module when it exists and conforms to what you want to do. That being said... If you're trying to do a web spider, why don't you save yourself some time and `use LWP::RobotUA` along with `use HTML::TokeParser` or `use HTML::Parser` (my personal favorite) as other monks suggested? Writing this task (extracting links to follow and analyzing the `<META>` tags) with `HTML::Parser` is a matter of a few lines. `LWP` already allows you to get the HTML. According to recipe 20.3 in The Perl Cookbook, you could also use `HTML::LinkExtor` to extract the links as this code shows (copied verbatim): `use HTML::LinkExtor; $parser = HTML::LinkExtor->new(undef, $base_url); $parser->parse_file($filename); @links = $parser->links; foreach $linkarray (@links) { my @element = @$linkarray; my $elt_type = shift @element; # element type # possibly test whether this is an element we're interested in while (@element) { # extract the next attribute and its value my ($attr_name, $attr_value) = splice(@element, 0, 2); # ... do something with them ... } }` [download] However, in any case I've needed to do this, I also have needed to parse the HTML, so I've always used `HTML::Parser` for that too. Hope this all helps a bit. Feel free to ask further questions. Best regards -lem, but some call me fokat	[reply] [d/l]
Re: Repost of regex by ihb (Deacon) on Feb 02, 2003 at 21:05 UTC
You've managed to escape the dot after `\1`. There are a couple of other direct problems with your pattern, but I won't even go there, since this is a dead end anyway. `ihb`	[reply] [d/l] [select]