You have a number of problems here. First of all, what if the submitted URL is the same as an existing one, but the new one has / at the end and the other has /index.html? or .php, or .htm, or whatever. If it has /index.*, you need to compare to the / version and see if they match. Second, what if the URL starts with a subdomain other than www, or no subdomain? You need to check against the regular www. URL. The best way to do this is by storing a MD5 hash of the contents of each page, and if you don't get a hash match, download the page already in the database again to make sure it wasn't just updated. Third, what if there are pages with duplicate content but different URLs? How do you handle those?
The big question of course is what exactly you're trying to do with your data once you have it in your database. It's hard to determine the how without knowing the why.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.