jonjacobmoon has asked for the wisdom of the Perl Monks concerning the following question:
This is the easy part.
The hard part is that these pages that the crawler will go to may have the link on any of a variety of ways. It may be in a frame, it may be generated by the javascript, or it may have meta-refresh, or it may be rendered on the page in any unforeseen way that a browser knows how to handle. In short, I need my program to look at the final rendered html source just as a browser would.
To illustrate: if I have LWP go to http://www.foo.com and foo.com has frames, then I need to check the source for each frame not the framesource.
I have some ideas of how to do this while following links, and adding exceptions for javascript, frames, and meta-refresh and any o others I can come up with, but I know that since browsers have all the exceptions handled, if it runs as if it is a browser, then I don't have to add exceptions as they come up.
Does any one have a easy way to do this that goes beyond but may even include HTML::Parser and LWP. I have researched this and know I can do it with an HTML::Parser, LWP combo where I follow certain links, but as I said, if it can act like a browser, I don't need to worry about following links to get the source for the page that the user would evenutally see.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
(cLive ;-) Re: Browser Emulation
by cLive ;-) (Prior) on Feb 02, 2002 at 23:18 UTC | |
|
Re (tilly) 1: Browser Emulation
by tilly (Archbishop) on Feb 03, 2002 at 03:15 UTC | |
|
Re: Browser Emulation
by trs80 (Priest) on Feb 02, 2002 at 21:24 UTC | |
|
Re: Browser Emulation
by gellyfish (Monsignor) on Feb 03, 2002 at 12:49 UTC | |
by jonjacobmoon (Pilgrim) on Feb 03, 2002 at 17:34 UTC | |
|
Re: Browser Emulation
by drifter (Scribe) on Feb 02, 2002 at 21:22 UTC | |
|
Re: Browser Emulation
by Cody Pendant (Prior) on Feb 02, 2002 at 22:34 UTC | |
|
Re: Browser Emulation
by Zaxo (Archbishop) on Feb 02, 2002 at 21:31 UTC | |
by theorbtwo (Prior) on Feb 02, 2002 at 21:42 UTC | |
by theorbtwo (Prior) on Feb 05, 2002 at 21:27 UTC |