in reply to Yet Another Scraping Question

Ask yourself this question:

How do I, as a human being, know that the page I'm looking at now is the same page I was just looking at a moment ago before I clicked "next"?

Once you have the answer to that question, write a function that encapsulates your answer, and use it to test each page.

Replies are listed 'Best First'.
Re^2: Yet Another Scraping Question
by Cody Pendant (Prior) on Apr 18, 2006 at 01:39 UTC
    Answer, I read the text, of course. But did I mention that the pages contain repetitive tables of data? It's entirely possible for any given row to contain the same text as the row on a previous page.

    But I guess you've given me an answer, because each product does have a unique ID so I can parse the HTML of a certain row of that table and check its "&PRODUCTID=" against the same value saved from before.

    I was just hoping for something more ... sexy I guess.



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print

      But what visual clue do you look at when reading the page that indicates to you that it's the same as the previous page? how can you tell the difference between "new data" that is "the same" as the data that was on the previous page, and "old data" that really is just a repeat of the data you've already seen?

      If you, as a human, can't do that -- then there's no way your code will be able to. ... But it sounds like you've already found your answer. you can tell if the page you are looking at is the same by looking at the PRODUCTID of each row, and if there is a duplicate (or all duplicate) from the last page, then it's hte same page.