Scott_J has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am currently having this problem. I have a section of HTML that looks like the following....

Go to the BBC's website <a href=http://www.bbc.co.uk">BBC</a> or you can visit the inland revenues pages at <a href="http://www.inlandrevenue.com">Inland Revenue</a> which will give you the information you need

From this code I need to extract out the data from the anchor tags so that the 'http://www.bbc.co.uk' part is in one variable, 'www.bbc.co.uk' (I understand that I can just strip the first part to get this) is in another and the 'BBC' part is also in a variable. I need to try this for each web link/anchor tag that I come across.

I have looked at the HTML toke parser module but I'm very new to Perl and it doesn't make much sense to me.

Thankyou

Replies are listed 'Best First'.
Re: Extracting href's
by valdez (Monsignor) on Jan 10, 2003 at 13:04 UTC

    Welcome to the Monastery!

    HTML::TokeParser scares you? no problem, Ovid made HTML::TokeParser::Simple that will make your life easier :)

    Here it is a little example:

    use HTML::TokeParser::Simple; use strict; use vars qw( $parser $attributes ); $parser = HTML::TokeParser::Simple->new($ARGV[0]); while ( my $token = $parser->get_token ) { next unless ($token->is_start_tag('a')); $attributes $token->return_attr; if(exists $attributes->{href}) { print $attributes->{href}, "\n"; } }

    We also have some nice Tutorials to get you started and many book reviewed at Book Reviews.

    Ciao, Valerio

      Thanks everyone. I'll have a try at this, this afternoon. :)
Re: Extracting href's
by andye (Curate) on Jan 10, 2003 at 12:26 UTC
    Hi Scott_J, and welcome to the monastery.

    You might be interested in How do I parse links out of a web page, and no doubt a super search will turn up some more material.

    update... removed the regexp I put here as it was incorrect... back in a sec with a better one... andy.

    continued...

    Basically you'd probably be better parsing the HTML, but if you want to avoid doing that then you could use a regexp like this:

    #!/usr/bin/perl -w use strict; use LWP::Simple; my $html = get('http://www.bbc.co.uk/'); while ($html =~ m|href\s*=\s*"((?:[^/]+://[^"/]+)?)/?([^"]+)"|gi) { print "$1, $2 \n"; }
    which seems to work OK.

    However, there's a couple of drawbacks:
    - it's quite hard to read,
    - it'll fail in certain situations, e.g. if the page contains quoted html as part of the actual page text, and probably in lots of other situations I haven't thought of.

    But... if this is just a quick and dirty hack, and reliability isn't a big issue, then the above regexp may do what you need.

    All the best,
    Andy.

    update again... re-read the question and found you also wanted the link text... hold on a mo...

    continued... Try this:

    while ($html =~ m|href\s*=\s*"((?:[^/]+://[^"/]+)?)/?([^"]+)"\s*>(.*?) +</A>|gi) { print "$1, $2, $3 \n"; }

      No offense to andye (still good advice), but parsing HTML with a regexp is a bad idea.

      Stick with a tried-and true module. It is far less likely to break on you (usually due to bad HTML, not code) and allows for further learning.

      Take for example HTML::TokeParser:

      my $content = get($url); my $ref = \$content; my $p = HTML::TokeParser->new($ref); my $token; while ($token = $p->get_tag("a")) { my $href = $token->[1]{href}; my $text = $p->get_trimmed_text("/a"); print "$href => $text"; } ## Should work...

      This looks intimitating, and it is. :) However, by learning how to use modules like TokeParser you'll not only get a better handle on what you want to do, but you'll be learning more about Perl, as well.

      Also, if you plan on doing this often, I suggest picking up a copy of Perl & LWP. It's a good resource for interacting with websites.

      John J Reiser
      newrisedesigns.com

      Well Scot, i never read the how to get links from web pages before but i have a solution, this code is only helpful if you are trying to get only that link here is my code.
      #!/usr/bin/perl -w use strict; my @getdata; while(<DATA>){ chomp; if ( /href=http/ ){ @getdata=split/\w+\s+|<|>|"|href|=/; } } print "$getdata[9]"; __DATA__ Go to the BBC's website <a href=http://www.bbc.co.uk">BBC</a> or you can visit the inland revenues pages at <a href="http://www.inlandrevenue.com">Inland Revenue</a> which will give you the information you need
      Hope you can guide you by your self
Re: Extracting href's
by vek (Prior) on Jan 10, 2003 at 15:20 UTC
    You've been given some good examples already but I thought I'd add one more. Try HTML::LinkExtractor , it should fit nicely with what you're attempting.

    -- vek --
Re: Extracting href's
by jdporter (Paladin) on Jan 10, 2003 at 22:08 UTC
    You'll hear a lot of people around here pushing HTML::TokeParser::Simple.
    Now, I'm not knocking it, but it seems to me that a lot of tasks that people want to do are actually more simply solved using HTML::TreeBuilder.
    use HTML::TreeBuilder; # given that the HTML to be parsed is in a variable named $html . . +. # this line extracts the links from the html: my $links = HTML::TreeBuilder->new_from_content( $html )->extract_li +nks; # and this iterates over the list of link data, slightly massaging e +ach one: for ( map { [ $_->[0], ( $_->[0] =~ m#://(.*)# ), $_->[1]->as_text ] + } @$links ) { # here are the three variables you wanted: my( $url, $address, $text ) = @$_; }

    jdporter
    The 6th Rule of Perl Club is -- There is no Rule #6.