First you need to get the page so use LWP::UserAgent or LWP::Simple
Once you have the content you want to extract the links. But there are a lot of links you don't want. Fortunately the site author makes it easy
<!-- Main Area --> <h1>The links you want are between the Main Area Comment and this comm +ent</h1> <!-- other boxes can be loaded by the subroutines? bad at end? hmmph - +->
I would use a regex to extract this chunk of the page.
my ($chunk) = $content =~ m#<!-- Main Area -->(.*?)<!-- other boxes ca +n be#s;
These pages appear to be autogen out of a DB. The great thing about that is consistency. For example these REs do what you want and SHOULD be reliable. YMMV.
HTML::LinkExtor is the module solution. There is absolutely nothing wrong with it for these sort of tasks, however when markup is autogenerated out of a DB REs are reliable and can often be tailored to extract just what you want instead of extracting all the links with LinkExtor and then having to post process them to get what you want.my ($poet) = $chunk =~ m!/poet/([^/ '"]+)!; my @poems = $chunk =~ m!(/Poems/\d+)!g; @poems = map{ "$site_url$_" } @poems;
As for runtime you sound like you want this under CGI in which case you need to recognise that you are looking at about 2 seconds a page on average and this is a bit harsh on your target site. Better to run it in the background and store the data in some form of DB - then you can be a good neighbour and not do a DOS on your poets. Of course merlyn has a tutorial in his Web Techniques articles on long running CGIs
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
In reply to Re: Where to start?
by tachyon
in thread Where to start?
by sulfericacid
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |