In short you can't get perl to Pretend to be a 'Real' Web Browser ie IE/Mozilla/NS/Opera. You can fake all the behaviour except for the Javascript/DOM redirection part. You can fake some javascript support but to do it all you need the DOM.

Here is a list of some of the issues you will need to deal with to get the 'real' pages.

  1. Use LWP::UserAgent to get the pages, works in vanilla form for > 90% pages
  2. Add a random agent string so LWP pretends to be IE 5/5.5/6. The easiest way to get them is to grep your apache access logs. There are also plent of lists on the net.
  3. Add in support for meta-refresh redirects (there are about 6 different 'valid' syntaxes - where valid means that browsers accept them)
  4. Add in frames support (vital)
  5. Add in cookie support as this is often tested for.

Once you have done all that the only 'rejects/cloaking' you will get will involve javascript redirects. There are numerous different variations of window.location = blah, window.location(blah), href.location = blah, href.location(blah), etc, etc.

Some of these you can parse and follow. Some you can't as they concat bits of the DOM into the redirect string.

When it comes to parsing the HTML HTML::Parser will cough up the javascript either in the comments or text (depending on how it is wrapped) so it is sub optimal. If you are only interested in popups you are basically looking for window.open and a few other strings. You can parse these out reasonably reliably with REs

We implemented all of the above on a current project, but eventually ended up hacking IE so that it is a headless, windowless, slave that goes and does our bidding. The nice part of that solution is that it really is IE doing the fetching so ..... no-one can tell it isn't IE. IE parses the HTML, sets the DOM, runs the javascript etc. We just gather up the HTML data from the parent and any child windows. You can hack Mozilla in a similar fashion.

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print


In reply to Re: PopUp Detection by tachyon
in thread PopUp Detection by BMaximus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.