I'm scraping a website with a pretty dumb interface.

There's a "next" button at the bottom. Always, even when you've reached the end of the data (the last page just reloads and the URL doesn't change).

So I can't just follow the "next" button with Mechanize, because it will stay in the loop forever.

So, how do I check whether "next" is really "next" or just the same page again?

I thought perhaps I could just save the length of the previous page's content and check it against the current one, but of course, nothing says two pages with different content can't have the same length, especially when they're repetitive tables of data.

So, next thought, I can compare the whole of the page as two huge strings: if($last_html eq $this_html) but as the page has things like timestamps in it, this could give a false negative.

Any ideas?



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

In reply to Yet Another Scraping Question by Cody Pendant

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.