Retrieve Rendered Web Page Using WWW::Mechanize::Chrome

roho has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to get a rendered web page returned using WWW::Mechanize::Chrome (referencing www.google.com for testing in the code below). I am using the "text" format of the content method, but all I get is raw HTML. I have scoured the module documentation but cannot find the option(s) to return rendered text. I am running Strawberry Perl 5.24.1 on Windows 10. Any ideas what I am missing here, or is it not currently possible to retrieve rendered text from a web page using this module? TIA for any suggestions.

#!/usr/bin/perl
use strict;
use warnings;

use WWW::Mechanize::Chrome;
use Log::Log4perl qw(:easy);
Log::Log4perl->easy_init($ERROR);

my $mech = WWW::Mechanize::Chrome->new(launch => 'C:\Program Files (x8
+6)\Google\Chrome\Application\chrome.exe');
$mech->get('https://www.google.com');
print $mech->content( format => 'text' );
[download]

"It's not how hard you work, it's how much you get done."

Comment on Retrieve Rendered Web Page Using WWW::Mechanize::Chrome Download Code

Replies are listed 'Best First'.
Re: Retrieve Rendered Web Page Using WWW::Mechanize::Chrome by Corion (Patriarch) on Nov 03, 2020 at 06:53 UTC
This shouldn't happen. Does using `$mech->text()` work for you? What version of WWW::Mechanize::Chrome are you using?	[reply] [d/l]
Re^2: Retrieve Rendered Web Page Using WWW::Mechanize::Chrome by roho (Bishop) on Nov 03, 2020 at 14:28 UTC
I get the same result using $mech->text(). I am using version 0.58 of WWW::Mechanize::Chrome. "It's not how hard you work, it's how much you get done."	[reply]
Re^3: Retrieve Rendered Web Page Using WWW::Mechanize::Chrome by Corion (Patriarch) on Nov 03, 2020 at 14:45 UTC
Indeed this is weird. When I run your code, I get the Google consent modal, and an error `2 elements found for //body at pb11123357.pl line 13.` [download] ... most likely because there are two body elements, one in the consent modal iframe and one in the main page. Once I click that away, I get the script content of the page, because the contents of `script` tags are also included in the `textContent` attribute. This is somewhat inconvenient, and I see no easy workaround for this. The fix seems to be to use the `innerText` attribute instead of `textContent`. Making this change makes the page "work" in the sense that the `<script>` content is not printed anymore. As a workaround you can monkey-patch the code until the next release: `use WWW::Mechanize::Chrome; { no warnings 'redefine'; sub WWW::Mechanize::Chrome::text { my $self = shift; # Waugh - this is highly inefficient but conveniently short to wri +te # Maybe this should skip SCRIPT nodes... join '', map { $_->get_attribute('innerText') } $self->xpath('//bo +dy', single => 1 ); } }` [download] Thanks for reporting this! If this works for you as well, I'll write a test for this and release the fix soonish.	[reply] [d/l] [select]
Re^4: Retrieve Rendered Web Page Using WWW::Mechanize::Chrome by roho (Bishop) on Nov 03, 2020 at 15:03 UTC