in reply to Re: Any spider framework?
in thread Any spider framework?
You're right. I looked at the source and found this abomination:
Ouch. There are so many ways that this can go wrong: "a" tags with a "name" and no href attribute, whitespace around the "=", ...s{<a\s+.*?href\=(.*?)>(.*?)</a>}{ ... }isgxe;
There are modules made espacially to extract links from HTML, for example HTML::LinkExtor and HTML::SimpleLinkExtor. Using one of those would have been a much safer approach.
But at least, this module takes "robots.txt" files in consideration, which is the polite thing to do, and probably one of the first things to go in a more naive approach. So that is good.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Any spider framework?
by tobyink (Canon) on Jan 06, 2012 at 12:51 UTC | |
by jdrago999 (Pilgrim) on Jan 08, 2012 at 04:54 UTC | |
by bart (Canon) on Jan 10, 2012 at 08:07 UTC | |
by jdrago999 (Pilgrim) on Jan 08, 2012 at 06:40 UTC |