Scraping Rendered Text that is not in Source Code

bobross419 has asked for the wisdom of the Perl Monks concerning the following question:

I've been fighting with this one all day. I've gone through quite a few threads here, but none of them seem to be helping. I'm trying to figure out a way to get text information that is rendered on the screen, but is absent in the source code.

I've come across numerous posts that say you should use this module, or that module, but quite frankly the documentation on some of these modules is too lackluster for a novice to follow.

I'm currently working with WWW::Mechanize::Firefox (which was suggested quite a few times), but it only seems to be able to return the basic source code of the page and not what is actually rendering on the screen.

I've also tried using WWW::Scripter with the Javascript plugin without success.

Basically you can find all the Perl Monks threads by checking the last post in this thread: http://www.perlmonks.org/?node_id=821773

I also attempted to use some ATT proxy thing that is supposed to let you see all the data passed, but it did nothing at all that I could see.

At one point I attempted to install the Firefox screen render plugin, but it appears that this is no longer available. However, I did find the View Source Chart add-on and it does include the rendered text in the source chart. I have no way of getting the data from that source chart over to perl though.

Does anyone have a way to do this other than what has already been suggested? At the very least, if someone could point me to some worthwhile documentation? I've read through everything on the CPAN site regarding WWW::Mechanize::Firefox (FAQ, Troubleshooting, Examples, etc) but nothing seems to indicate how to actually pull this information. The examples for Javascript don't seem to work at all and just throw compilation errors.

I've only been using Perl for a few weeks, but I love the language. I just need a push in the right direction for doing this.

Here is the page that I'm using as an example: http://www.acehardware.com/mystore/storeDetail.jsp?store=14671

I'm trying to see if I can get the address information. I know the span/div IDs for the items I want, it just won't come through for me.

Here is where I'm currently at with WWW::Mechanize::Firefox.

#!usr/bin/perl
use strict;
use warnings;

use WWW::Mechanize::Firefox;
my $mech = WWW::Mechanize::Firefox->new(autoclose => 0);
$mech->allow(javascript => 1);

#$mech->get('http://www.acehardware.com/mystore/storeDetail.jsp?store=
+14671', ':content_file' => 'webpage.txt');
$mech->get('http://www.acehardware.com/mystore/storeDetail.jsp?store=1
+4671');

my $retries = 10;
    while ($retries-- and ! $mech->is_visible( xpath => '//*[@id="city
+"]' )) {
          sleep 1;
    };
    die "Timeout" unless $retries;
    
    # Now the element exists
    #$mech->click({xpath => '//*[@id="submit"]'});

print "..." . $mech->xpath('//dataAddress2[id="city"]', one => 1);
open(FO,">test.txt") or die "unable";
print FO $mech->content;
print "DONE!";
[download]

This gives me the following error:

No elements found for '//dataAddress2[id="city"]' at script2.pl line 2.

Thanks in advance from an aspiring perlophyte.

Comment on Scraping Rendered Text that is not in Source Code Select or Download Code

Replies are listed 'Best First'.
Re: Scraping Rendered Text that is not in Source Code by Corion (Patriarch) on Oct 31, 2010 at 09:23 UTC
Your script finds the element you seem to be looking for once I fix the bad Xpath query in line 21: `//dataAddress2[id="city"]` [download] would be searching for an HTML tag `dataAddress2`, which does not exist on that page (nor anywhere else). As you are searching for an element with an `id` attribute anyway, and `id` attributes are (supposed to be) unique across the page, using the following XPath expression extracts the element for me (provided I've unblocked the crappy Javascript on all those pages in Noscript): `//[@id="city"]` [download] For finding what elements I've captured, I like to print `->{innerHTML}`: `print "..." . $mech->xpath('//[@id="city"]', one => 1)->{innerHTML};` [download] It seems that the Javascript gets triggered after some time without another event and the element just gets filled in instead of actually appearing, so you might need to wait in a loop to watch the element content change from ` ` to the content you actually want.	[reply] [d/l] [select]
Re^2: Scraping Rendered Text that is not in Source Code by bobross419 (Acolyte) on Oct 31, 2010 at 19:58 UTC
Thanks corion, the `->{innerHTML}` is the part that's been missing for me. In one of many previous attempts, I did have the correct Xpath syntax, but because I didn't know about `->{innerHTML}` it just returned a Hash code. I am now off and running on the right path. I solved the wait problem by modifying the while loop to check to see if the "city" ID equals `&nbsp`. Again, thank you so much, and thanks to the rest of the guys/gals that offered some information to help me along the way.	[reply] [d/l] [select]
Re: Scraping Rendered Text that is not in Source Code by kcott (Archbishop) on Oct 31, 2010 at 05:36 UTC
If you're having problems with the xpaths themselves, the W3C XPath documentation has many examples of xpath syntax and abbreviations. Another module that seems popular for this type of work is HTML::TreeBuilder::XPath. Also, in your code you have `while (...) {...};` - there shouldn't be a semicolon at the end. I also notice the error is reported as `...at script2.pl line 2.` but line 2 of the script you posted is `use strict;`. -- Ken	[reply] [d/l] [select]
Re^2: Scraping Rendered Text that is not in Source Code by bobross419 (Acolyte) on Oct 31, 2010 at 06:09 UTC
Thanks for the reply. I've never done anything with XPaths before, but I did look into the documentation a little. At this point I was going on hour 7 of what I thought would be an easy script... The while loop was copied straight off of a CPAN example somewhere, but I'll definitely keep that in mind for the future. About the error, I really don't know. I went through so many error messages today that they all just blurred together. I'll look at it again tomorrow when I get back to work. Thanks again.	[reply]
Re^3: Scraping Rendered Text that is not in Source Code by kcott (Archbishop) on Oct 31, 2010 at 10:39 UTC
I had a look at the example page you gave. Both the HTML and Javascript are buggy. The HTML::TreeBuilder::XPath I mentioned won't be of any use in this situation. I was able to get to the city element with `'//span[@id="city"]'`. The id attributes are supposed to be unique so I'd recommend targetting them directly - that should hopefully get around issues with malformed markup. And it looks like I'm now starting to repeat what Corion already has below, so I'll shut up now. :-) -- Ken	[reply] [d/l]
Re: Scraping Rendered Text that is not in Source Code by Gangabass (Vicar) on Oct 31, 2010 at 04:42 UTC
Sorry but my reply don't involve using WWW::Mechanize::Firefox. But you can get all data you want directly from this URL: http://www.acehardware.com/storeLocServ?heavy=true&token=ACE&operation=storeData&storeID=14671&_= I'll get it using HTTPFox extension in Firefox.	[reply]
Re^2: Scraping Rendered Text that is not in Source Code by bobross419 (Acolyte) on Oct 31, 2010 at 06:06 UTC
Thanks, I'm open to any suggestions and I'm not stuck on WWW::Mechanize::Firefox. I did try using that same URL string which I was able to find in the .js file (not at work atm so I don't have all the right pages), but I left off the "&_=" at the end. I'll see if I can make that work tomorrow. EDIT: Actually I took a quick look and I think that might just work for my needs. I'll update again tomorrow if this works for me :) Thanks again.	[reply]