Re: unix perl web spider

ow.

Arbitrary remote code execution bug. (e.g. <a href=";rm[tab]-rf[tab]/">)
?, & and ; are very common characters in URLs which are also shell metacharacters.
Useless overhead due to recursion.
Breadth-first usually works better anyway.
Useless overhead from using curl instead of LWP.
Relative links aren't handled at all.
Tries to extract links from non-HTML documents.
Doesn't extract all links that could reference HTML docs.
It's baffled by frames, for example.
No throttling or robot niceties.
No check is done to see if a page has already been visited. (Update)
No constraints limiting the spidering to a domain or URL path. (Update)
It checks the depth after extracting the links and spawning numerous instances of perl while it could do so before. (Update)

Naïve means simple, not bad.

Comment on Re: unix perl web spider Select or Download Code