This is the easy part.
The hard part is that these pages that the crawler will go to may have the link on any of a variety of ways. It may be in a frame, it may be generated by the javascript, or it may have meta-refresh, or it may be rendered on the page in any unforeseen way that a browser knows how to handle. In short, I need my program to look at the final rendered html source just as a browser would.
To illustrate: if I have LWP go to http://www.foo.com and foo.com has frames, then I need to check the source for each frame not the framesource.
I have some ideas of how to do this while following links, and adding exceptions for javascript, frames, and meta-refresh and any o others I can come up with, but I know that since browsers have all the exceptions handled, if it runs as if it is a browser, then I don't have to add exceptions as they come up.
Does any one have a easy way to do this that goes beyond but may even include HTML::Parser and LWP. I have researched this and know I can do it with an HTML::Parser, LWP combo where I follow certain links, but as I said, if it can act like a browser, I don't need to worry about following links to get the source for the page that the user would evenutally see.
In reply to Browser Emulation by jonjacobmoon
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |