in reply to Win32::IE::Mechanize completed?

You're actually waiting for MSIE to finish building the page. For Google, I see no problem, but for a page that depends partly on Javascript to complete the page (using document.write(), for example), you have to wait.

What I've done until now, is load the page twice, and then wait a second. Not great, but it worked rather well. But you just gave me a new hint.

So I tried printing out $ie->{agent}->ReadyState in a loop, with just a little sleep after using

use Time::HiRes 'sleep';

It turns out that on a page depending on Javascript, for a little while, ReadyState returns 3, and then it jumps to 4. That would seem like a pretty reliable way to get to see if the page is actually finished.

Checking the source for _wait_while_busy() in Win32::IE::Mechanize (0.008), I spotted the comment:

# The documentation isn't clear on this. # The DocumentComplete event roughly says: # the event gets fired (for each frame) after ReadyState == 4

That points in the same ditrection. Perhaps access to ReadyState should be more formalized, but for now, the next snippet seems to work well for me:

my $url = '...'; # you choose $ie->get($url); use Time::HiRes 'sleep'; while($ie->{agent}->ReadyState < 4) { sleep 0.055; } $\ = "\n"; print $_->url foreach $ie->links;

Note that I picked 55ms for the sleep time, because that appears to roughly be the resolution of the timer in Windows. It also looks like a good compromise to me, not too fast, nor too slow.

Replies are listed 'Best First'.
Re^2: Win32::IE::Mechanize completed?
by 2ge (Scribe) on Apr 13, 2005 at 08:42 UTC
    Hello bart,

    whanks for nice reply. I tried this before, and I was playing with ReadyState, but for me sometimes jumps to 4, and sometimes is still 3 (even if I have in IE status done). It works at you on any page ? try huge pages, for example http://www.albinoblacksheep.com/; sites where is flash it doesn't work. Also, when I use some openproxy it many times gets me state=3, also on "easy" pages (xhtml+js). So I can't use this to determine. But it is better than nothing, now we can specify timeout, and after timeout we can extract links, if no links found, reload :). It is always better to define static sleep time. Ok, thanks.