in reply to Re: How can I find the links in HTML tags?
in thread How can I find the links in HTML tags?
And it missed the two valid href links (the first and the last) and wrongly flagged the second one as an href link.<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN"> <HTML> <HEAD><TITLE>Test</TITLE></HEAD> <BODY> <a href ="foobaz"></a> <a name="href=foo"></a> <a name="foo>bar" href="foobar"></a> </BODY> </HTML>
I think it's very important to realize that while doing this sort of thing seems easy, it isn't. There are a lot of cases that you will miss if you try. Use LinkExtor. Use HTML::Parser. Use HTML::TokeParser. Heck, use URI::Find. Just don't "do it yourself" unless you're prepared to devote quite a bit of time developing, honing, and fixing your solution.
I think that Dermot knows all of this, judging from his comment that using the CPAN is probably the best way to go, but I wanted to make sure no one decided to use this instead because it was "easier". Do not.
I leave you with something I posted to Usenet not too long ago -- a script that correctly finds all anchor links in a document -- in 4 lines of Perl. It's that easy to do with HTML::Parser or one of the other tools made for such things.
-dlc
#!/usr/bin/perl -wl use strict;use HTML::Parser;my $p=HTML::Parser->new(api_version =>3);$p->handler(start=>sub{print shift->{href}if shift eq 'a'}, 'tagname,attr');local $/;$p->parse(<>);#Just another URI finder
|
|---|