I'm trying to create a web crawler and I was interested what information you monks have on creating them. I know there are plenty of ways these could be mis-used but in the end it'll be setup so it scrapes 1 page per period of time to go easy on the server.
What I'm trying to do is put in a $url and have it scan that page for links. Then, if it finds any, it'll go to the first link and branch off as many times as it can until all pages it has access and links to are scanned.
I could probably manage using LWP::Simple or maybe LWP::UserAgent to scrape the main page for the links and possibly do it for the first link after that. But when speaking of branching out to get all the links from one page (like a tree of links), I have NO idea where to begin.
And using regexes would probably be a pain because webmasters don't always use FULL URLS like they should. Then you have to scan dynamic URLs like /cgi-bin/script.cgi?param=12&name=test .
Are there modules to do this type of thing I could use? Any advice would be much appreciated.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.