DigitalKitty has asked for the wisdom of the Perl Monks concerning the following question:

Hi all.

I just finished writing this and I was wondering if anyone had some suggestions? I was curious if I could create a useful link extractor *without* the use a module. Here is the code:

#!/usr/bin/perl -w #Path to perl interpreter. use strict; #The strict pragma. my @link_array; #Declare an array named link_array. @ARGV = "test1001.html"; #The file on the 'command line'. while(<>) #Does the file still have content? { s/<(?:[^>'"]*|(['"]).*?")*>//gs; #Remove all HTML tags. s/^(\s+)//g; #Remove all leading whitespace. #If a match is found, add it to the end of the array. #The search is global and case-insensitive. push @link_array, $_ if(/^http:/gi); push @link_array, $_ if(/^ftp:/gi); push @link_array, $_ if(/^mailto:/gi); } #End of the while loop. open( FH, ">>links.txt" ); #Open the file links.txt for #appending. print FH @link_array, "\n"; #Write the links we found to #the file. close FH; #Close the file handle.

Replies are listed 'Best First'.
(jeffa) Re: Pretty cool link extractor.
by jeffa (Bishop) on Mar 26, 2002 at 00:51 UTC
    That didn't work for the following:
    <a href="http://foo.com">bar</a>
    <a href="index.html">index</a>
    
    Maybe i am missing something, but i think that the regex you use to 'remove all HTML tags' isn't working the way you think it should. Here is how i would do it:
    use strict; use Data::Dumper; my @link; my @data = <DATA>; for (@data) { my ($url,$label) = $_ =~ /href\s*=\s*"([^"]+)"\s*>([^<]+)/; next unless $url and $label; push @link, [$url,$label]; } print Dumper \@link; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a>
    But i would NEVER use that in any serious code (it has its limitations - only one link per line). I would use a module. Now, why people think that writing code to bypass using a module (that has already been tested and used by many, many people around the world) is a 'good thing' elludes me. Is it because you don't have permission? Then please read A Guide to Installing Modules - there is no excuse.

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Even that won't catch everything, such as a mixed case 'href', using single quotes instead of double (or none at all!), any other attributes after the href (such as javascript events) or that the label might contain an unquoted '<'. You're not just trying to match valid HTML you're trying to match HTML that is "out there". Your example also doesn't catch the case where there might not be a label at all, it might be an image.

      For these reasons I whole heartedly recommend using one of the HTML:: modules eg, HTML::LinkExtor or HTML::TreeBuilder. Just because I am feeling perverse I've come up with a perverse regex that seems to work with my 'odd' cases (though it will fail if there is whitespace in the url):

      use strict; use warnings; use Data::Dump qw(dump); my @links = (); my $html = do { local $/; <DATA> }; while ($html =~ /[Hh][Rr][Ee][Ff]\s*=\s*['"]?([^\s"'>]+)['"]?.*?>(.*?) +<\s*\/\s*[Aa]\s*>/gs) { push @links, [$1, $2]; } print dump(@links), "\n"; __DATA__ <a href="http://foo.com">bar</a> <a href="index.html">index</a> <a href='/blah'>some text</a> <a href="http://some.url.com" onClick="">blah</a> <a href=http://bad.bad.bad>text</a> <a href="encodeme.html">< back</a> <a href="image.gif"><img src="blah.jpg"></a> <a href="/multline/example.html" >this is some text </a>
      ick! :)

      Update:belg4mit suggested URI::Find which sounds like a sensible idea.

      gav^

Re: Pretty cool link extractor.
by shotgunefx (Parson) on Mar 26, 2002 at 00:54 UTC
    A couple of points. There are many. I'll address a few. The regex that removes tags will fail on certain inputs. Parsers are made for a reason.

    Most links are in tags which you are throwing away.

    What if the line contains http://yahoo.com stinks?
    "http://yahoo.com stinks" is not a url.

    You could combine all three push statements into one.
    push @line_array,$_ if (/^(http|ftp|mailto):/);

    I personally don't see anything wrong with trying to reinvent wheels, you can learn alot. But you should study the wheel and see what it does and what you can do better.

    -Lee

    "To be civilized is to deny one's nature."

      >I personally don't see anything wrong with trying to
      >reinvent wheels, you can learn alot. But you should study
      >the wheel and see what it does and what you can do better.

      Well said!
      In that spirit, I offer a different cool link extractor:
      perl -MHTML::LinkExtor -e 'print qq{@$_\n} foreach HTML::LinkExtor->new->parse_file($ARGV[0])->links'
      What's cool is not that it is a one-liner, but that it is usable as a fast "tool" in my editor. While viewing a page in my web browser (Opera), I hit a command key to view source (in UltraEdit), another command key to extract links, and I have all the links from that page in a unnamed buffer. I use this every day.

      Bruce Gray

Re: Pretty cool link extractor.
by jeffenstein (Hermit) on Mar 26, 2002 at 07:03 UTC

    Others have commented on your code. I just wanted to mention something about the comments in your code.

    It's much better to make a block comment before a section to describe what the section is doing, and leave out all the single-line comments that are obvious from the code. For instance, "#Path to perl interpreter" and "#Open the file links.txt for appending" are obvious from the code and should be eliminated. They only clutter up the code itself, and make it more difficult to follow the flow of the program.

    The Practice of Programming by Kernigan & Pike has a chapter that goes over basic coding style, and is an excellent guide for good commenting.

      To all.
      Thanks for the feedback. Since I am still learning perl, my well intentioned ideas will sometimes be a 'bit off'. I suppose that is part of the learning process. As for the 'commenting', my programming developed that trend when I took a 'C' course last semester. The instructor would decrement your grade *by a full letter* if you didn't comment every line. Ugh... I'll make the necessary corrections in future code samples.

      DigitalKitty

      -> Meow <-