bliako has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks,

I have the following script which attempts to download 3 urls but gets stuck ($mech->get(url)) on the second. The last trace is 'method' => 'Network.loadingFinished',.

Does anyone have any debugging tips I can use to find why it gets stuck there? Using other urls seems to be working OK.

#!/usr/bin/env perl use strict; use warnings; use Log::Log4perl qw(:easy); use WWW::Mechanize::Chrome; my @urls = ( 'https://zoom.earth/#34.957995,32.299805,5z,sat,am,2018-07-20', 'https://zoom.earth/#34.957995,32.299805,5z,sat,am,2018-07-21', 'https://zoom.earth/#34.957995,32.299805,5z,sat,am,2018-07-22', ); Log::Log4perl->easy_init($TRACE); print "$0 : starting headless chrome ...\n"; my $mech = WWW::Mechanize::Chrome->new( headless => 1, launch_arg => [ '--password-store=basic', '--remote-debugging-port=9223', '--enable-logging', '--disable-gpu', '--no-sandbox', '--ignore-certificate-errors', '--disable-background-networking', '--disable-client-side-phishing-detection', '--disable-component-update', '--disable-hang-monitor', '--disable-save-password-bubble', '--disable-default-apps', '--disable-infobars', '--disable-popup-blocking', '--disable-default-apps', ], ); if( ! defined($mech) ){ print STDERR "$0 : call to ".'WWW::Mechanize:: +Chrome->new()'." has failed.\n"; exit(1) } print "$0 : done, headless chrome is now running.\n"; $mech->add_header('User-agent' => 'Mozilla/5.0 (X11; Linux x86_64) App +leWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.39 Safari/537.36 +'); my $idx = 1; foreach my $aurl (@urls){ my $outfile = "out.$idx.png"; print "$0 : about to get '$aurl'\n"; get_and_shot($mech, $aurl, $outfile, 4) or die "get_and_shot() + : url '$aurl'"; print "$0 : done got '$aurl'\n"; $idx++; } # returns 0 on failure, 1 on success sub get_and_shot { my $mech = $_[0]; my $aurl = $_[1]; my $outfile = $_[2]; my $sleeptime = $_[3] || 2; print 'get_and_shot()'." : entered for url '$aurl'\n"; if( ! defined($mech) ){ print "$0 : mock done\n"; return 1 } print 'get_and_shot()'." : getting url '$aurl'\n"; if( ! $mech->get($aurl) ){ print STDERR "get_and_shot() : call + to ".'get()'." has failed for url '$aurl'.\n"; return 0 } print 'get_and_shot()'." : got OK url '$aurl'.\n"; my $page_png = $mech->content_as_png(); my $fh; if( ! open($fh, '>', $outfile) ){ print STDERR "get_and_shot() + : could not save url '$aurl' to output file '$outfile', $!\n"; retur +n 0 } binmode $fh, ':raw'; print $fh $page_png; close $fh; print 'get_and_shot()'." : saved OK '$aurl' to '$outfile', now + sleeping for $sleeptime seconds ...\n"; sleep($sleeptime); print 'get_and_shot()'." : done, woken up now and exiting sub. +\n"; return 1 # success };

As I said it works find with other urls until it encounters the 2nd or 3rd url from zoomearth, for example setting @urls to :

my @urls = ( 'http://www.ibm.com', 'http://www.ibm.com', 'http://www.ibm.com', 'https://zoom.earth/#34.957995,32.299805,5z,sat,am,2018-07-20', 'https://zoom.earth/#34.957995,32.299805,5z,sat,am,2018-07-21', 'https://zoom.earth/#34.957995,32.299805,5z,sat,am,2018-07-22', );

Will stop after all the ibm's have been fetched and screenshot.

Update 1: even the ibm's sometimes don't work - mech gets stuck on them too sometimes ...

Any ideas?

Replies are listed 'Best First'.
Re: WWW::Mechanize::Chrome : gets stuck sometimes
by Corion (Patriarch) on Aug 01, 2018 at 18:17 UTC

    Ah hah - it seems that loading the same URL twice (as the zoom.earth URLs are, except for the anchor), makes Chrome issue a Page.navigatedWithinDocument event instead of one of the other events. I'll have to find out how to handle those in a sane fashion, as all the network traffic then will be come from Javascript afterwards ...

    Update A quick workaround is to navigate to a different page before fetching the next page:

    ... $mech->get('about:blank'); $mech->get($aurl); ...

    I've managed to reproduce the situation outside this setup - any two URLs that are identical except for the fragment will trigger this problem.

      excellent catch. I did indeed see a  Page.navigatedWithinDocument error being reported but forgot to mention it ... the heat.

      WWW::Mechanize::Chrome is at version 0.17, google-chrome is Version 68.0.3440.75 (Official Build) (64-bit) and it is run under linux fedora 27 latest kernels and all, perl is at v5.26.2.

      Unrelated: what is the proper way to shutdown the client and make sure that no "chrome" process is still running? And how do I reset cookies? Is there a WWW::Mechanize::Chrome forum where I can post these questions without pestering you here or shall I post another question?

      thanks, bliako

        https://perlmonks.org is the correct "forum" to use for support and bug reports. This should also be noted somewhere in the documentation.

        The interface for cookies is fairly rudimentary so far, but if you use a private Chrome instance with its own profile, you'll get a fresh/empty cookie jar every time.

        The proper way to shutdown your Chrome instance is to let the WWW::Mechanize::Chrome object go out of scope. That should clean up the Chrome process associated with it. But maintaining that is a constant battle against circular references.

Re: WWW::Mechanize::Chrome : gets stuck sometimes
by Corion (Patriarch) on Aug 01, 2018 at 13:44 UTC

    Most likely, there is some sequence of events fired during loading a page that Chrome produces and WWW::Mechanize does not cater for.

    The interesting events usually start with Page.frameScheduledNavigation, Page.frameStartedLoading or Network.requestWillBeSent. Maybe some resource is now retrieved from cache and some event is not fired when the module expects it to be fired.

    Without seeing the sequence of events, this is hard to track down. I'll try your script later and see if I can reproduce it.

    What version of WWW::Mechanize::Chrome and what version of Chrome are you using?