Re: Scraping Rendered Text that is not in Source Code

Your script finds the element you seem to be looking for once I fix the bad Xpath query in line 21:

//dataAddress2[id="city"]
[download]

would be searching for an HTML tag dataAddress2, which does not exist on that page (nor anywhere else).

As you are searching for an element with an id attribute anyway, and id attributes are (supposed to be) unique across the page, using the following XPath expression extracts the element for me (provided I've unblocked the crappy Javascript on all those pages in Noscript):

//*[@id="city"]
[download]

For finding what elements I've captured, I like to print ->{innerHTML}:

print "..." . $mech->xpath('//*[@id="city"]', one => 1)->{innerHTML};
[download]

It seems that the Javascript gets triggered after some time without another event and the element just gets filled in instead of actually appearing, so you might need to wait in a loop to watch the element content change from   to the content you actually want.

Comment on Re: Scraping Rendered Text that is not in Source Code Select or Download Code

Replies are listed 'Best First'.
Re^2: Scraping Rendered Text that is not in Source Code by bobross419 (Acolyte) on Oct 31, 2010 at 19:58 UTC
Thanks corion, the `->{innerHTML}` is the part that's been missing for me. In one of many previous attempts, I did have the correct Xpath syntax, but because I didn't know about `->{innerHTML}` it just returned a Hash code. I am now off and running on the right path. I solved the wait problem by modifying the while loop to check to see if the "city" ID equals `&nbsp`. Again, thank you so much, and thanks to the rest of the guys/gals that offered some information to help me along the way.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Scraping Rendered Text that is not in Source Code
by bobross419 (Acolyte) on Oct 31, 2010 at 19:58 UTC

Thanks corion, the ->{innerHTML} is the part that's been missing for me.

In one of many previous attempts, I did have the correct Xpath syntax, but because I didn't know about ->{innerHTML} it just returned a Hash code. I am now off and running on the right path.

I solved the wait problem by modifying the while loop to check to see if the "city" ID equals &nbsp.

Again, thank you so much, and thanks to the rest of the guys/gals that offered some information to help me along the way.

[reply]
[d/l]
[select]