unix perl web spider

Replies are listed 'Best First'.
Re: unix perl web spider by ikegami (Patriarch) on Jan 12, 2008 at 19:24 UTC
ow. Arbitrary remote code execution bug. (e.g. `<a href=";rm[tab]-rf[tab]/">`) `?`, `&` and `;` are very common characters in URLs which are also shell metacharacters. Useless overhead due to recursion. Breadth-first usually works better anyway. Useless overhead from using `curl` instead of LWP. Relative links aren't handled at all. Tries to extract links from non-HTML documents. Doesn't extract all links that could reference HTML docs. It's baffled by frames, for example. No throttling or robot niceties. No check is done to see if a page has already been visited. (Update) No constraints limiting the spidering to a domain or URL path. (Update) It checks the depth after extracting the links and spawning numerous instances of `perl` while it could do so before. (Update) Naïve means simple, not bad.	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: unix perl web spider by graff (Chancellor) on Jan 12, 2008 at 19:24 UTC
Your script actually creates a lot of trouble (apart from the fact that this wheel has already been invented numerous times). In no particular order: When the user does not provide the args that you need, it's a good idea to "die" with a helpful message, to let the user know what is expected. You should be more careful with your ARGV checks: is the first arg numeric? does the second arg look like a usable url? Your "if" condition seems to imply that you want to allow for an "undefined" value of $depth_level, but there is no way for that value to be undefined. You need some other strategy for "unlimited depth". The biggest problem is that you extract href strings from a web page, and pass these strings to a shell command without any sort of special precaution; that might work for a few very simple web sites, but quite often you will get unintended results when you pass "$new_call" to a subshell in backticks. You really should be using LWP::Simple rather than using "curl" in backticks. so that you don't need to worry about strange or dangerous results from "shell-magic" characters in the urls. In any case, it'll be more efficient than using sub-shells (esp. nested ones). You should also get acquainted with the concept of recursive subroutine calls, so that you have just one subroutine in one process that handles all the depth levels (instead of a separate sub process for each level). In any case, I strongly advise that you do not use the code as originally posted, especially not in a unix shell.	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: unix perl web spider by peter (Sexton) on Feb 09, 2008 at 19:47 UTC
Of course `wget -m` could do the same thing better (and that's only one 'standard' unix tool). Peter Stuifzand	[reply]