ow.
- Arbitrary remote code execution bug. (e.g. <a href=";rm[tab]-rf[tab]/">)
?, & and ; are very common characters in URLs which are also shell metacharacters.
- Useless overhead due to recursion.
Breadth-first usually works better anyway.
- Useless overhead from using curl instead of LWP.
- Relative links aren't handled at all.
- Tries to extract links from non-HTML documents.
- Doesn't extract all links that could reference HTML docs.
It's baffled by frames, for example.
- No throttling or robot niceties.
- No check is done to see if a page has already been visited. (Update)
- No constraints limiting the spidering to a domain or URL path. (Update)
- It checks the depth after extracting the links and spawning numerous instances of perl while it could do so before. (Update)
Naïve means simple, not bad.