First you need to get the page so use LWP::UserAgent or LWP::Simple

Once you have the content you want to extract the links. But there are a lot of links you don't want. Fortunately the site author makes it easy

<!-- Main Area --> <h1>The links you want are between the Main Area Comment and this comm +ent</h1> <!-- other boxes can be loaded by the subroutines? bad at end? hmmph - +->

I would use a regex to extract this chunk of the page.

my ($chunk) = $content =~ m#<!-- Main Area -->(.*?)<!-- other boxes ca +n be#s;

These pages appear to be autogen out of a DB. The great thing about that is consistency. For example these REs do what you want and SHOULD be reliable. YMMV.

my ($poet) = $chunk =~ m!/poet/([^/ '"]+)!; my @poems = $chunk =~ m!(/Poems/\d+)!g; @poems = map{ "$site_url$_" } @poems;
HTML::LinkExtor is the module solution. There is absolutely nothing wrong with it for these sort of tasks, however when markup is autogenerated out of a DB REs are reliable and can often be tailored to extract just what you want instead of extracting all the links with LinkExtor and then having to post process them to get what you want.

As for runtime you sound like you want this under CGI in which case you need to recognise that you are looking at about 2 seconds a page on average and this is a bit harsh on your target site. Better to run it in the background and store the data in some form of DB - then you can be a good neighbour and not do a DOS on your poets. Of course merlyn has a tutorial in his Web Techniques articles on long running CGIs

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print


In reply to Re: Where to start? by tachyon
in thread Where to start? by sulfericacid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.