I have been asked to write a script that is a sort of limited crawler that acts like a browser. This is to say, the script will be looking for a specific link on a page and report if it is present or not.

This is the easy part.

The hard part is that these pages that the crawler will go to may have the link on any of a variety of ways. It may be in a frame, it may be generated by the javascript, or it may have meta-refresh, or it may be rendered on the page in any unforeseen way that a browser knows how to handle. In short, I need my program to look at the final rendered html source just as a browser would.

To illustrate: if I have LWP go to http://www.foo.com and foo.com has frames, then I need to check the source for each frame not the framesource.

I have some ideas of how to do this while following links, and adding exceptions for javascript, frames, and meta-refresh and any o others I can come up with, but I know that since browsers have all the exceptions handled, if it runs as if it is a browser, then I don't have to add exceptions as they come up.

Does any one have a easy way to do this that goes beyond but may even include HTML::Parser and LWP. I have researched this and know I can do it with an HTML::Parser, LWP combo where I follow certain links, but as I said, if it can act like a browser, I don't need to worry about following links to get the source for the page that the user would evenutally see.


I admit it, I am Paco.

In reply to Browser Emulation by jonjacobmoon

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.