how do i split a link

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

What is the best way to split a link in HTML format so that I can grab the domain name?

The link could be in several formats. I won't know until run time. Here are a couple of possibilities...

<a href="http://www.foo.com">description</a>
<a href='http://www.foo.com'>image here</a>
[download]

I will need to parse and return foo.com

I really stink with reg ex's :-(

Thanks Monks!!

Comment on how do i split a link Download Code

Replies are listed 'Best First'.

Re: how do i split a link
by gav^ (Curate) on Apr 13, 2002 at 17:09 UTC

HTML::LinkExtor

URI

use HTML::LinkExtor;
use URI;

my @links = ();
my $html = do { local $/; <DATA> };

sub extract_links {
    my ($tag, %attr) = @_;
    next unless $tag eq 'a';
    my @parts = split /\./, URI->new($attr{href})->host;
    my $host = join '.', @parts[-2, -1];
    push @links, $host;
}

my $p = HTML::LinkExtor->new(\&extract_links);
$p->parse($html);

print join "\n", @links;

__DATA__
<a href="http://www.foo.com">description</a>
<a href='http://www.foo.com'>image here</a>
[download]

gav^

[reply]
[d/l]

(crazyinsomniac) Re^2: how do i split a link

by crazyinsomniac (Prior) on Apr 14, 2002 at 10:16 UTC

use HTML::LinkExtor;
my @links = ();
my $html = join'',<DATA>; # much more elegant than => do { local $/; <
+DATA> };

sub extract_links {
    my ($tag,undef,$url) = @_;

    if($tag eq 'a') {
        push @links, $url->host;
    }
}

my $p = HTML::LinkExtor->new(\&extract_links,'http://foobar.com');
$p->parse($html);

print join "\n", @links;

__DATA__
<a href="http://www.foo.com">description</a>
<a href='http://www.foo.com'>image here</a>
<A href='http://foo-bar-publishers.co.uk'>image here</a>
[download]

update: no need for a patch, it's in there (at least in $VERSION = sprintf("%d.%02d", q$Revision: 1.31 $ =~ /(\d+)\.(\d+)/);).

______crazyinsomniac_____________________________
Of all the things I've lost, I miss my mind the most.
perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

[reply]
[d/l]

Re: (crazyinsomniac) Re^2: how do i split a link

by gav^ (Curate) on Apr 14, 2002 at 19:44 UTC

gav^

[reply]

Re: how do i split a link
by cjf (Parson) on Apr 13, 2002 at 16:55 UTC

It's best to use a module for HTML parsing, it's more complicated than it looks :). HTML::TokeParser should do nicely, there's even a tutorial on it here.

[reply]

Re: how do i split a link
by Amoe (Friar) on Apr 13, 2002 at 18:08 UTC

sub gimmelinks {
    my $html = shift;
    my @linkywinkys;
    use HTML::TokeParser;
    my $parsee = HTML::TokeParser->new(\$html);
    while (my $tag = $parsee->get_tag('a')) {
         push @linkywinkys, {url => $tag->[1]->{href},
                             txt => $parsee->get_trimmed_text('/a')};
    }
    return @linkywinkys;
}
[download]

That should do you fine if you want to use good old HTML::TokeParser. You get the link text free as well! It returns an array of hashes, where the url is the "url" key and the text is the "txt" key. What am I saying - figure it out yourself :)

[reply]
[d/l]