Re: Scraping Rendered Text that is not in Source Code

If you're having problems with the xpaths themselves, the W3C XPath documentation has many examples of xpath syntax and abbreviations.

Another module that seems popular for this type of work is HTML::TreeBuilder::XPath.

Also, in your code you have while (...) {...}; - there shouldn't be a semicolon at the end.

I also notice the error is reported as ...at script2.pl line 2. but line 2 of the script you posted is use strict;.

-- Ken

Comment on Re: Scraping Rendered Text that is not in Source Code Select or Download Code

Replies are listed 'Best First'.
Re^2: Scraping Rendered Text that is not in Source Code by bobross419 (Acolyte) on Oct 31, 2010 at 06:09 UTC
Thanks for the reply. I've never done anything with XPaths before, but I did look into the documentation a little. At this point I was going on hour 7 of what I thought would be an easy script... The while loop was copied straight off of a CPAN example somewhere, but I'll definitely keep that in mind for the future. About the error, I really don't know. I went through so many error messages today that they all just blurred together. I'll look at it again tomorrow when I get back to work. Thanks again.	[reply]
Re^3: Scraping Rendered Text that is not in Source Code by kcott (Archbishop) on Oct 31, 2010 at 10:39 UTC
I had a look at the example page you gave. Both the HTML and Javascript are buggy. The HTML::TreeBuilder::XPath I mentioned won't be of any use in this situation. I was able to get to the city element with `'//span[@id="city"]'`. The id attributes are supposed to be unique so I'd recommend targetting them directly - that should hopefully get around issues with malformed markup. And it looks like I'm now starting to repeat what Corion already has below, so I'll shut up now. :-) -- Ken	[reply] [d/l]