in reply to INSERT or UPDATE, but only when unique

You have a number of problems here. First of all, what if the submitted URL is the same as an existing one, but the new one has / at the end and the other has /index.html? or .php, or .htm, or whatever. If it has /index.*, you need to compare to the / version and see if they match. Second, what if the URL starts with a subdomain other than www, or no subdomain? You need to check against the regular www. URL. The best way to do this is by storing a MD5 hash of the contents of each page, and if you don't get a hash match, download the page already in the database again to make sure it wasn't just updated. Third, what if there are pages with duplicate content but different URLs? How do you handle those?

The big question of course is what exactly you're trying to do with your data once you have it in your database. It's hard to determine the how without knowing the why.

  • Comment on Re: INSERT or UPDATE, but only when unique