Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

What is the best way to split a link in HTML format so that I can grab the domain name?

The link could be in several formats. I won't know until run time. Here are a couple of possibilities...

<a href="http://www.foo.com">description</a> <a href='http://www.foo.com'>image here</a>

I will need to parse and return foo.com

I really stink with reg ex's :-(

Thanks Monks!!

Replies are listed 'Best First'.
Re: how do i split a link
by gav^ (Curate) on Apr 13, 2002 at 17:09 UTC
    You could combine HTML::LinkExtor and URI:
    use HTML::LinkExtor; use URI; my @links = (); my $html = do { local $/; <DATA> }; sub extract_links { my ($tag, %attr) = @_; next unless $tag eq 'a'; my @parts = split /\./, URI->new($attr{href})->host; my $host = join '.', @parts[-2, -1]; push @links, $host; } my $p = HTML::LinkExtor->new(\&extract_links); $p->parse($html); print join "\n", @links; __DATA__ <a href="http://www.foo.com">description</a> <a href='http://www.foo.com'>image here</a>
    Of course you might want to add some error checking...

    gav^

      You might find reading the module, as well as its documentation, saves typing ;)
      use HTML::LinkExtor; my @links = (); my $html = join'',<DATA>; # much more elegant than => do { local $/; < +DATA> }; sub extract_links { my ($tag,undef,$url) = @_; if($tag eq 'a') { push @links, $url->host; } } my $p = HTML::LinkExtor->new(\&extract_links,'http://foobar.com'); $p->parse($html); print join "\n", @links; __DATA__ <a href="http://www.foo.com">description</a> <a href='http://www.foo.com'>image here</a> <A href='http://foo-bar-publishers.co.uk'>image here</a>
      Also, this "foo.com" request is rather silly, considering all the weirdo naming conventions out there (city.county.state.us ...)

      update: no need for a patch, it's in there (at least in $VERSION = sprintf("%d.%02d", q$Revision: 1.31 $ =~ /(\d+)\.(\d+)/);).

       
      ______crazyinsomniac_____________________________
      Of all the things I've lost, I miss my mind the most.
      perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

        Thanks, I never knew about that 3rd parameter. Perhaps you can suggest a patch to the documentation?

        gav^

Re: how do i split a link
by cjf (Parson) on Apr 13, 2002 at 16:55 UTC

    It's best to use a module for HTML parsing, it's more complicated than it looks :). HTML::TokeParser should do nicely, there's even a tutorial on it here.

Re: how do i split a link
by Amoe (Friar) on Apr 13, 2002 at 18:08 UTC
    sub gimmelinks { my $html = shift; my @linkywinkys; use HTML::TokeParser; my $parsee = HTML::TokeParser->new(\$html); while (my $tag = $parsee->get_tag('a')) { push @linkywinkys, {url => $tag->[1]->{href}, txt => $parsee->get_trimmed_text('/a')}; } return @linkywinkys; }

    That should do you fine if you want to use good old HTML::TokeParser. You get the link text free as well! It returns an array of hashes, where the url is the "url" key and the text is the "txt" key. What am I saying - figure it out yourself :)

    --
    amoe