in reply to Extracting href's
You might be interested in How do I parse links out of a web page, and no doubt a super search will turn up some more material.
update... removed the regexp I put here as it was incorrect... back in a sec with a better one... andy.
continued...
Basically you'd probably be better parsing the HTML, but if you want to avoid doing that then you could use a regexp like this:
which seems to work OK.#!/usr/bin/perl -w use strict; use LWP::Simple; my $html = get('http://www.bbc.co.uk/'); while ($html =~ m|href\s*=\s*"((?:[^/]+://[^"/]+)?)/?([^"]+)"|gi) { print "$1, $2 \n"; }
However, there's a couple of drawbacks:
- it's quite hard to read,
- it'll fail in certain situations, e.g. if the page contains quoted html as part of the actual page text, and probably in lots of other situations I haven't thought of.
But... if this is just a quick and dirty hack, and reliability isn't a big issue, then the above regexp may do what you need.
All the best,
Andy.
update again... re-read the question and found you also wanted the link text... hold on a mo...
continued... Try this:
while ($html =~ m|href\s*=\s*"((?:[^/]+://[^"/]+)?)/?([^"]+)"\s*>(.*?) +</A>|gi) { print "$1, $2, $3 \n"; }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re:^2 (nrd) Extracting attributes from anchor tags in an HTML page
by newrisedesigns (Curate) on Jan 10, 2003 at 20:17 UTC | |
|
Re: Re: Extracting href's
by Anonymous Monk on Jan 10, 2003 at 12:52 UTC |