You'd want to make use of Perl's LWP, HTTP and HTML libraries for datamining.
Using those libraries and some of the modules they include, you can make a good web-crawling program that will extract pretty much anything you want to extract(ie. links to other sites,images,etc.).
There are a few different ways to implement this stuff, so I would take the time to read the documentation on things such as:
- HTTP::Request
- HTTP::Response
- LWP::UserAgent
- LWP::Simple
- HTML::Parser
- URI::Heuristic
From there you can explore a couple of different methods by which to achieve your goal. Keep in mind however, that some sites take measures to prevent the use of spiders, so your options may be limited as to which direction you take(ie.,easy way or harder way).
Hope that helps.
Amel - f.k.a. - kel
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.