Re^2: Any spider framework?

You're right. I looked at the source and found this abomination:

s{<a\s+.*?href\=(.*?)>(.*?)</a>}{
...
}isgxe;
[download]

Ouch. There are so many ways that this can go wrong: "a" tags with a "name" and no href attribute, whitespace around the "=", ...

There are modules made espacially to extract links from HTML, for example HTML::LinkExtor and HTML::SimpleLinkExtor. Using one of those would have been a much safer approach.

But at least, this module takes "robots.txt" files in consideration, which is the polite thing to do, and probably one of the first things to go in a more naive approach. So that is good.

Comment on Re^2: Any spider framework? Download Code

Replies are listed 'Best First'.
Re^3: Any spider framework? by tobyink (Canon) on Jan 06, 2012 at 12:51 UTC
In the case of `<a name="foo">` it simply won't match, as the regexp includes href. And you wouldn't want it to match, as it's not a link. Whitespace around the equals sign (which is rare, but valid) is more problematic. There are other edge cases which behave differently to how you might want them to as well - note that the first subcapture allows ">" to occur within it. But in practise, it's probably good enough to work for the majority of people. The author may well accept a patch to parse the page properly using HTML::Parser given that the module already has a dependency on that module (indirectly, via LWP::UserAgent). Or if you can't wait for a new fixed version to be released, just subclass it - it's only really that one method that's in major need of fixing.	[reply] [d/l]
Re^4: Any spider framework? by jdrago999 (Pilgrim) on Jan 08, 2012 at 04:54 UTC
As the author of WWW::Crawler::Lite, I am also appalled at the use of that regexp for URL detection! (What was I thinking?) I am quite pressed for time at the moment, but I will put the module on github and re-release it with the patches/updates suggested on RT already. FWIW I use this module in several places (and have for some time now). While there are perhaps some more "robust" spiders/crawlers out there, I wasn't able to find one as simple to use and understand as W:C:L. Once the github + pause uploads are completed, I'll re-post here. Thanks!	[reply]
Re^4: Any spider framework? by bart (Canon) on Jan 10, 2012 at 08:07 UTC
In the case of `<a name="foo">` it simply won't match, as the regexp includes href. And what makes you think the regex would limit itself to a single tag? In your example, the "`<a`" could be matched while the "href=" would be much further down in the document. In fact, there is no guarantee that that this string is a tag attribute, it could just be in plain html text ("PCDATA"), Javascript code, or even in HTML comments. To be reliable, a parser (actually just a lexer; it could be regex based) should extract whole tags, and you should then test each on its own. That would be much more reliable.	[reply] [d/l] [select]
Re^4: Any spider framework? by jdrago999 (Pilgrim) on Jan 08, 2012 at 06:40 UTC
OK! As promised, the patches/updates/POD have been applied, github now hosts the code and I've put the newest release on github at https://github.com/jdrago999/WWW-Crawler-Lite Thanks everyone for your suggestions and time... Now you can get the HTML::LinkExtor version of link-parsing by specifying 'link_parser => "HTML::LinkExtor"' in the constructor. Otherwise you get the 'default' (original, regexp-based) way. Maybe this could be changed...actually...to use something slick like Web::Query to get at that information (which, for me, was the whole point).	[reply]