in reply to Extracting href's

Hi Scott_J, and welcome to the monastery.

You might be interested in How do I parse links out of a web page, and no doubt a super search will turn up some more material.

update... removed the regexp I put here as it was incorrect... back in a sec with a better one... andy.

continued...

Basically you'd probably be better parsing the HTML, but if you want to avoid doing that then you could use a regexp like this:

#!/usr/bin/perl -w use strict; use LWP::Simple; my $html = get('http://www.bbc.co.uk/'); while ($html =~ m|href\s*=\s*"((?:[^/]+://[^"/]+)?)/?([^"]+)"|gi) { print "$1, $2 \n"; }
which seems to work OK.

However, there's a couple of drawbacks:
- it's quite hard to read,
- it'll fail in certain situations, e.g. if the page contains quoted html as part of the actual page text, and probably in lots of other situations I haven't thought of.

But... if this is just a quick and dirty hack, and reliability isn't a big issue, then the above regexp may do what you need.

All the best,
Andy.

update again... re-read the question and found you also wanted the link text... hold on a mo...

continued... Try this:

while ($html =~ m|href\s*=\s*"((?:[^/]+://[^"/]+)?)/?([^"]+)"\s*>(.*?) +</A>|gi) { print "$1, $2, $3 \n"; }

Replies are listed 'Best First'.
Re:^2 (nrd) Extracting attributes from anchor tags in an HTML page
by newrisedesigns (Curate) on Jan 10, 2003 at 20:17 UTC

    No offense to andye (still good advice), but parsing HTML with a regexp is a bad idea.

    Stick with a tried-and true module. It is far less likely to break on you (usually due to bad HTML, not code) and allows for further learning.

    Take for example HTML::TokeParser:

    my $content = get($url); my $ref = \$content; my $p = HTML::TokeParser->new($ref); my $token; while ($token = $p->get_tag("a")) { my $href = $token->[1]{href}; my $text = $p->get_trimmed_text("/a"); print "$href => $text"; } ## Should work...

    This looks intimitating, and it is. :) However, by learning how to use modules like TokeParser you'll not only get a better handle on what you want to do, but you'll be learning more about Perl, as well.

    Also, if you plan on doing this often, I suggest picking up a copy of Perl & LWP. It's a good resource for interacting with websites.

    John J Reiser
    newrisedesigns.com

Re: Re: Extracting href's
by Anonymous Monk on Jan 10, 2003 at 12:52 UTC
    Well Scot, i never read the how to get links from web pages before but i have a solution, this code is only helpful if you are trying to get only that link here is my code.
    #!/usr/bin/perl -w use strict; my @getdata; while(<DATA>){ chomp; if ( /href=http/ ){ @getdata=split/\w+\s+|<|>|"|href|=/; } } print "$getdata[9]"; __DATA__ Go to the BBC's website <a href=http://www.bbc.co.uk">BBC</a> or you can visit the inland revenues pages at <a href="http://www.inlandrevenue.com">Inland Revenue</a> which will give you the information you need
    Hope you can guide you by your self