in reply to WWW::Mechanize find_link question.

Dear dEvNuL,

Well, I've never worked with the Mechanize library either, but if I understand your question correctly, I would personally use a pattern match to solve your problem.

If you want to capture the last link in some HTML:

# If you have a link that includes a space, # then remove the space from the last set of brackets ($linkloc) = ($html =~ m/.*href=["']([^"'> ]+)/s);
Now, if you wanted to match the last url in your document that specifically linked an image (if there were more links that you want to ignore that follow the image link):
($linkloc) = ($html =~ m/.*href=["']([^"'> ]+)[^>]*>\s*<img/s);

Using your html, in both cases, $linkloc becomes url?page=2

I hope this is helpful. Best,
  -Adam

Replies are listed 'Best First'.
Re^2: WWW::Mechanize find_link question.
by merlyn (Sage) on May 13, 2005 at 02:49 UTC
    /me downvoted, because using regex to match HTML is almost always wrong, unless you use a very correct regex, which you didn't.

    Please see the other answers in this thread for much better solutions.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      Dear Merlyn,

      It so happens that this particular user is trying to parse specifically formatted HTML. I would normally agree with you, but a regex is especially convenient when one is expecting data of a certain structure - this seems to meet that condition.

      Also, I'm interested in how you would modify the regex to meet your more stringent requirements. Always looking to better my ability here.

        -Adam
        Perhaps you missed the "WWW::Mechanize" in the subject? The original poster is already using Mechanize, and already has had the document parsed using proper means under the hood, and the question was about using find_links properly.

        Thus, a solution to abandon all that seems crazy. That's the craziness I was pointing out.

        -- Randal L. Schwartz, Perl hacker
        Be sure to read my standard disclaimer if this is a reply.

        The above exchange over how to parse html seems to be a running controversy in the monastery.

        At Being a heretic and going against the party line, browseruk criticizes "cargo cult" reliance on html::tokeparser, html::treebuilder, and other html::* modules when regexes would do fine, and also because the html::s are hard to learn and don't deserve the praise the community gives them:

        This was in reply to Parsing HTML tags with regex, which is a good starting thread for various methods of parsing html, including browseruk's simple regex solution, which led to all the controversy after he got downvoted.