Yet Another Scraping Question

Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm scraping a website with a pretty dumb interface.

There's a "next" button at the bottom. Always, even when you've reached the end of the data (the last page just reloads and the URL doesn't change).

So I can't just follow the "next" button with Mechanize, because it will stay in the loop forever.

So, how do I check whether "next" is really "next" or just the same page again?

I thought perhaps I could just save the length of the previous page's content and check it against the current one, but of course, nothing says two pages with different content can't have the same length, especially when they're repetitive tables of data.

So, next thought, I can compare the whole of the page as two huge strings: if($last_html eq $this_html) but as the page has things like timestamps in it, this could give a false negative.

Any ideas?

($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Comment on Yet Another Scraping Question Download Code

Replies are listed 'Best First'.
Re: Yet Another Scraping Question by bobf (Monsignor) on Apr 18, 2006 at 01:23 UTC
It sounds to me like you've already answered your own question: the last page just reloads and the URL doesn't change If the URL doesn't change, then it will be the same as the current page, right? If there are other subtleties to this that weren't mentioned in the OP, please elaborate. This approach should be possible, and it will probably be more reliable than trying to match content (especially if there are floating ads or other dynamic content in the page). Update: If you want to track what is being sent when you click on a link, you can use Mozilla's liveHTTPheaders or Ethereal. HTH	[reply]
Re^2: Yet Another Scraping Question by Cody Pendant (Prior) on Apr 18, 2006 at 01:45 UTC
If there are other subtleties to this that weren't mentioned in the OP, please elaborate. Sorry, I wasn't clear. The URL never changes. It's always "company.com/showdata.asp", and the "next" link always points to "company.com/showdata.asp?move=nextpage". Obviously something is happening in the background whereby the fact I was on page 1 before is stored in a session or something, and the nextpage value of "2" calculated in some way that I can't see in either URL. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re^2: Yet Another Scraping Question by Cody Pendant (Prior) on Apr 18, 2006 at 02:09 UTC
Good point about the HTTP headers, but honestly, what's sent is what I showed you, and there's nothing going on this side of the server which shows me what the current or next page is. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re: Yet Another Scraping Question by hossman (Prior) on Apr 18, 2006 at 00:59 UTC
Ask yourself this question: How do I, as a human being, know that the page I'm looking at now is the same page I was just looking at a moment ago before I clicked "next"? Once you have the answer to that question, write a function that encapsulates your answer, and use it to test each page.	[reply]
Re^2: Yet Another Scraping Question by Cody Pendant (Prior) on Apr 18, 2006 at 01:39 UTC
Answer, I read the text, of course. But did I mention that the pages contain repetitive tables of data? It's entirely possible for any given row to contain the same text as the row on a previous page. But I guess you've given me an answer, because each product does have a unique ID so I can parse the HTML of a certain row of that table and check its "&PRODUCTID=" against the same value saved from before. I was just hoping for something more ... sexy I guess. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re^3: Yet Another Scraping Question by hossman (Prior) on Apr 18, 2006 at 02:19 UTC
But what visual clue do you look at when reading the page that indicates to you that it's the same as the previous page? how can you tell the difference between "new data" that is "the same" as the data that was on the previous page, and "old data" that really is just a repeat of the data you've already seen? If you, as a human, can't do that -- then there's no way your code will be able to. ... But it sounds like you've already found your answer. you can tell if the page you are looking at is the same by looking at the PRODUCTID of each row, and if there is a duplicate (or all duplicate) from the last page, then it's hte same page.	[reply]
Re: Yet Another Scraping Question by izut (Chaplain) on Apr 18, 2006 at 09:20 UTC
Have you inspected the contents of the loaded page? You can perform a md5sum on its contents, if it is the same of the last viewed page, you're done. Igor 'izut' Sutton your code, your rules.	[reply]
Re^2: Yet Another Scraping Question by Cody Pendant (Prior) on Apr 19, 2006 at 05:15 UTC
Nice idea, but it would fail if I happened to have two different timestamps, wouldn't it? I could however do the checksum on the HTML table which forms the bulk of the page instead of the whole page. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]
Re^3: Yet Another Scraping Question by izut (Chaplain) on Apr 19, 2006 at 09:21 UTC
That's right. I think performing a checksum at HTML table would be enough. Igor 'izut' Sutton your code, your rules.	[reply]
Re: Yet Another Scraping Question by polettix (Vicar) on Apr 18, 2006 at 10:34 UTC
You could try to evaluate the "distance" between the old and the new page, and put a stop if they're sufficiently "close". A possible approach would be using one of the modules already available (e.g. Text::Levenshtein or Text::LevenshteinXS, but there should be others), you'll probably have to tune things a bit after getting a working setup. Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply]
Re^2: Yet Another Scraping Question by Cody Pendant (Prior) on Apr 19, 2006 at 05:13 UTC
Also a good idea, thanks for that. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply]