cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day Monks. I am trying to get stories from Reuters that come over its RSS feed. Some of these stories are multi-part so to get the whole thing it's necessary to follow the "Next" link at the bottom. Alas, this is a #!&%* javascript link which WWW::Mechanize can't follow.

So I'm trying to do it with Win32::IE::Mechanize which can supposedly follow those links. When I point it at, for example, perlmonks.com, it works fine, but when I point it at one of the URLS from the Retuers feed it doesn't:

use strict; use Win32::IE::Mechanize; my $iemech = Win32::IE::Mechanize->new( visible => 1); $iemech->get('http://feeds.reuters.com/~r/reuters/topNews/~3/84952673/ +newsarticle.aspx'); my $html = $iemech->content; print $html;
produces the html
<HTML><HEAD><LINK href="http://i.today.reuters.com/media/styles/rcom-a +rticle.css" type=text/css rel=stylesheet><LINK href="http://i.today.r +euters.com/media/styles/rcom-master.css" type=text/css rel=stylesheet +> <SCRIPT language=javascript src="http://i.today.reuters.com/News/scrip +t/links.js" type=text/javascript></SCRIPT> </HEAD></HTML>
which ain't anywhere close to the html for what's showing in the IE window.

One thing I notice is that there is a redirect happening. But unlike WWW:Mechanize, Win32::IE::Mechanize seems not to store the content in its object but (I guess) gets it from the browser DOM. So it seems like the content method should return whatever is showing in the browser. But as you will see if you try the code, it doesn't.

Anyone know if there's a fix for this?

TIA...

Steve

Replies are listed 'Best First'.
Re: Win32::IE::Mechanize not getting correct content
by un-chomp (Scribe) on Mar 15, 2007 at 14:49 UTC
    I'd highly recommend using Win32::IEAutomation as an alternative - I haven't looked back since switching. This should give you the full HTML for the page, but note it does not exactly match what you see when you 'view source' - it's more like IE's internal representation (with funny business like uppercase tags)
    #!/usr/bin/perl use strict; use warnings; use Win32::IEAutomation; my $ie = Win32::IEAutomation->new( visible => 1, maximize => 1 ); my $url = 'http://feeds.reuters.com/~r/reuters/topNews/~3/84952673/new +sarticle.aspx'; $ie->gotoURL( $url ); print $ie->Content;
Re: Win32::IE::Mechanize not getting correct content
by scorpio17 (Canon) on Mar 15, 2007 at 21:02 UTC

    I've recently been playing around with AJAX (hasn't everyone?) and something annoying that I have noticed is the following:

    If you load an HTML page into IE that contains a div tag, like this:

    <div id="ajaxstuff" >Hello World!</div>

    and a javascript function that somehow changes this div's content, like this:

    document.getElementById("ajaxstuff").innerHTML = 'Howdy!';

    maybe this javascript is inside a function that gets run when the page loads:

    <body onLoad="init();">

    or maybe it gets run as the result of an event, like clicking a button somewhere on the page - it doesn't seem to make a difference...

    If you select 'view->source' from the IE browser menu anytime after this function has been run, you will see the original "Hello World!" message, not the updated "Howdy!" message - even though "Howdy!" is being displayed in the current browser window!

    This doesn't really answer your question (sorry!) - but I think it's an important clue to what you're seeing: if a page is somehow modified by javascript after the initial page load, then view->source will not reflect the update, but will instead show the original HTML. So any script attempting to mechanize this will be hitting the proverbial moving target.

    You might be able to 'reverse engineer' the real (updated) HTML by looking for div tags, javascript functions, and possibly URLS invoked behind the scenes (if you point your browser to these, you'll get back the raw response data used to update the page).

    Of course, it may be encrypted or obfuscated, etc., but it's worth a try.

    BTW - firefox seems to do the same thing. :-(

    Good luck!

      If you want to see the effect of script changes on the DOM (that is, you want to see the current document structure instead of the original HTML source) in firefox take a look at firebug. It shows live updates to the DOM and CSS (and allows you to change it), has a script-accessible javascript log (with stack traces), can evaluate user-entered javascript in the current page's context, logs XMLHTTP requests and more. It really is the most useful tool I know for javascript (and HTML) debugging.

Re: Win32::IE::Mechanize not getting correct content
by jhourcle (Prior) on Mar 15, 2007 at 16:37 UTC

    I don't know what you're trying to do with this task, and I know it doesn't help you solve the problem you're having ...

    ... but are you aware that there's a robots.txt on the site, specifically requesting that you not automate access to its contents?

    (yes, I know, there are many differing interpretations of what types of programs should use the robots.txt -- for instance, if you manually started the process, but it just grabbed the pages to present them as one page for you, that's likely different than trying to retrieve the pages for pre-caching or a search engine spider)

    Update: ikegami is right -- my browser put its insertion bar at the end, which I mistook for a /. (bah ... time to get my eyes checked again). They specifically allow robot access. Feel free to down vote my oversight.

      Did the robots.txt change? Cause it currently doesn't disallow anything. (Disallow: / would.)
      Yes I'm aware. The robots.txt file appears to be targeted at feedburner.com. And you're right, this doesn't help me solve my problem.