Pretty cool link extractor.

DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(jeffa) Re: Pretty cool link extractor. by jeffa (Bishop) on Mar 26, 2002 at 00:51 UTC
That didn't work for the following: <a href="http://foo.com">bar</a> <a href="index.html">index</a> Maybe i am missing something, but i think that the regex you use to 'remove all HTML tags' isn't working the way you think it should. Here is how i would do it: `use strict; use Data::Dumper; my @link; my @data = <DATA>; for (@data) { my ($url,$label) = $_ =~ /href\s=\s"([^"]+)"\s>([^<]+)/; next unless $url and $label; push @link, [$url,$label]; } print Dumper \@link; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a>` [download] But i would NEVER use that in any serious code (it has its limitations - only one link per line). I would use a module. Now, why people think that writing code to bypass using a module (that has already been tested and used by many, many people around the world)* is a 'good thing' elludes me. Is it because you don't have permission? Then please read A Guide to Installing Modules - there is no excuse. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l]
Re: (jeffa) Re: Pretty cool link extractor. by gav^ (Curate) on Mar 26, 2002 at 04:23 UTC
Even that won't catch everything, such as a mixed case 'href', using single quotes instead of double (or none at all!), any other attributes after the href (such as javascript events) or that the label might contain an unquoted '<'. You're not just trying to match valid HTML you're trying to match HTML that is "out there". Your example also doesn't catch the case where there might not be a label at all, it might be an image. For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url): use strict; use warnings; use Data::Dump qw(dump); my @links = (); my $html = do { local $/; <DATA> }; while ($html =~ /[Hh][Rr][Ee][Ff]\s=\s['"]?([^\s"'>]+)['"]?.?>(.?) +<\s\/\s[Aa]\s>/gs) { push @links, [$1, $2]; } print dump(@links), "\n"; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a> <a href='/blah'>some text</a> <a href="http://some.url.com" onClick="">blah</a> <a href=http://bad.bad.bad>text</a> <a href="encodeme.html">< back</a> <a href="image.gif"><img src="blah.jpg"></a> <a href="/multline/example.html" >this is some text </a> [download] ick! :) Update:*belg4mit suggested URI::Find which sounds like a sensible idea. gav^	[reply] [d/l]
Re: Pretty cool link extractor. by shotgunefx (Parson) on Mar 26, 2002 at 00:54 UTC
A couple of points. There are many. I'll address a few. The regex that removes tags will fail on certain inputs. Parsers are made for a reason. Most links are in tags which you are throwing away. What if the line contains http://yahoo.com stinks? "http://yahoo.com stinks" is not a url. You could combine all three push statements into one. push @line_array,$_ if (/^(http\|ftp\|mailto):/); I personally don't see anything wrong with trying to reinvent wheels, you can learn alot. But you should study the wheel and see what it does and what you can do better. -Lee "To be civilized is to deny one's nature."	[reply]
Re: Re: Pretty cool link extractor. by Util (Priest) on Mar 26, 2002 at 18:05 UTC
>I personally don't see anything wrong with trying to >reinvent wheels, you can learn alot. But you should study >the wheel and see what it does and what you can do better. Well said! In that spirit, I offer a different cool link extractor: `perl -MHTML::LinkExtor -e 'print qq{@$_\n} foreach HTML::LinkExtor->new->parse_file($ARGV[0])->links'` What's cool is not that it is a one-liner, but that it is usable as a fast "tool" in my editor. While viewing a page in my web browser (Opera), I hit a command key to view source (in UltraEdit), another command key to extract links, and I have all the links from that page in a unnamed buffer. I use this every day. Bruce Gray	[reply] [d/l]
Re: Pretty cool link extractor. by jeffenstein (Hermit) on Mar 26, 2002 at 07:03 UTC
Others have commented on your code. I just wanted to mention something about the comments in your code. It's much better to make a block comment before a section to describe what the section is doing, and leave out all the single-line comments that are obvious from the code. For instance, "`#Path to perl interpreter`" and "`#Open the file links.txt for appending`" are obvious from the code and should be eliminated. They only clutter up the code itself, and make it more difficult to follow the flow of the program. The Practice of Programming by Kernigan & Pike has a chapter that goes over basic coding style, and is an excellent guide for good commenting.	[reply] [d/l] [select]
Re: Re: Pretty cool link extractor. by DigitalKitty (Parson) on Mar 26, 2002 at 20:38 UTC
To all. Thanks for the feedback. Since I am still learning perl, my well intentioned ideas will sometimes be a 'bit off'. I suppose that is part of the learning process. As for the 'commenting', my programming developed that trend when I took a 'C' course last semester. The instructor would decrement your grade by a full letter if you didn't comment every line. Ugh... I'll make the necessary corrections in future code samples. DigitalKitty -> Meow <-	[reply]