How to scrape an HTTPS website that has JavaScript

Bucki has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to scrape an HTTPS website that has JavaScript by jeffa (Bishop) on Aug 23, 2004 at 16:08 UTC
Sometimes you can't do what you want, but sometimes you can grab the Javascript code and run it though Javascript. I have successfully done this, but it was a specialized case. Don't forget to install the necessary Javascript libriaries as prescribed in the docs for that module, by the way. jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re^2: How to scrape an HTTPS website that has JavaScript by McMahon (Chaplain) on Aug 23, 2004 at 16:15 UTC
jeffa, if you had any trivial sample code to do this, it would be great to see it. I've failed to do this before...	[reply]
Re^3: How to scrape an HTTPS website that has JavaScript by jeffa (Bishop) on Aug 23, 2004 at 16:23 UTC
Here ya go: it's old and the demo link is down because i am ... still being lazy. :D Still, it was fun while it lasted. (jeffa) Re: Encrypt web files! jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply]
Re: How to scrape an HTTPS website that has JavaScript by tomhukins (Curate) on Aug 23, 2004 at 16:31 UTC
One approach I have used successfully is to let HTTP::Recorder generate WWW::Mechanize scripts for the actions you want to take, which means your Web browser handles the JavaScript for you. There's a Perl.com article on HTTP::Recorder.	[reply]
Re^2: How to scrape an HTTPS website that has JavaScript by Limbic~Region (Chancellor) on Aug 23, 2004 at 16:55 UTC
tomhukins, You have succesfully gotten HTTP::Recorder to do JavaScript? You might want to let the author know, since the docs for even the latest development release state it won't record JavaScript Actions. Cheers - L~R	[reply]
Re^3: How to scrape an HTTPS website that has JavaScript by tomhukins (Curate) on Aug 23, 2004 at 17:08 UTC
It was a few months ago, but I recall successfully using HTTP::Recorder to use HTTP::Proxy with JavaScript enabled sites. HTTP::Proxy traps all HTTP requests, regardless of whether they are plain HTML hyperlinks or initiated by JavaScript. Granted, this only works when JavaScript does something simple, but in my experience it usually does. If the JavaScript does anything complex, then you're right: you'll have to either run a JavaScript interpreter within Perl or rewrite the algorithm in Perl.	[reply]
Re^3: How to scrape an HTTPS website that has JavaScript by perrin (Chancellor) on Aug 23, 2004 at 19:17 UTC
I did let the author know. She said that it kind of depends on what your JavaScript does. However, this is more the case if you are trying to use HTTP::Recorder to generate QA scripts that test your JavaScript. I don't think there is anything JavaScript could do that would matter when scraping a site which would not be captured by an HTTP proxy.	[reply]
Re: How to scrape an HTTPS website that has JavaScript by Happy-the-monk (Canon) on Aug 23, 2004 at 16:11 UTC
jeffa has pointed out one way to go. Some other times, HTTP links buried in JavaScript are just HTTP links. You can follow those using WWW::Mechanize while you might need a browser to figure out the URLs. Cheers, Sören	[reply]
Re: How to scrape an HTTPS website that has JavaScript by dmorgo (Pilgrim) on Aug 23, 2004 at 19:27 UTC
I haven't used it, but I thought the module Javascript::SpiderMonkey was made for this kind of thing. Does anyone have experience with this module?	[reply]
Re^2: How to scrape an HTTPS website that has JavaScript by saintmike (Vicar) on Aug 24, 2004 at 06:06 UTC
As explained here, that's only one small part of the problem.	[reply]
Re: How to scrape an HTTPS website that has JavaScript by saintmike (Vicar) on Aug 23, 2004 at 17:46 UTC
See also this discussion.	[reply]
Re: How to scrape an HTTPS website that has JavaScript by elwarren (Priest) on Aug 24, 2004 at 18:18 UTC
I once needed to work with javascript. I just parsed the html and grepped for the variable initialization that I needed and took the value with a regex. No need to execute the code just to concatenate a couple of strings together. As for the other part of your request regarding an https request, I've had mixed results. I have been unable to get https working through our proxy/firewall at work. I can go through the proxy via http without a problem. A bit of googling leads me to a link dated 2001 that says libwww and crypt::ssleay almost work together but don't. Here's a short example I've used to illustrate the problem. It fails for me on ActiveState 5.8.4 on xp. (*nix is not an option as this is my work machine, nor is cygwin) #!/usr/bin/perl use warnings; use strict; use LWP::UserAgent; #http://groups.yahoo.com/group/libwww-perl/message/7242 $\|=1; my @hosts = ('http://login.yahoo.com', 'https://login.yahoo.com'); my $ua=LWP::UserAgent->new; $ua->agent("Mozilla/5.0 "); my $https_proxy=$ENV{https_proxy}; delete $ENV{https_proxy} if ($https_proxy); $ua->env_proxy; $ENV{https_proxy}=$https_proxy if ($https_proxy); foreach (@hosts) { my $req = HTTP::Request->new(GET => $_); #$req->proxy_authorization_basic($ENV{HTTP_PROXY_USER}, $ENV{HTTP_PR +OXY_PASS}); my $res = $ua->request($req); if ($res->is_success) { print $res->status_line, "\nsomething\n"; } else { print $res->status_line, "\nnothing\n"; } } [download]	[reply] [d/l]
Re^2: How to scrape an HTTPS website WOOHOO by elwarren (Priest) on Aug 24, 2004 at 18:45 UTC
Woohoo! It works now :-) After digging this old problem out of the dead projects folder, I couldn't leave it alone. It seems that Crypt::SSLeay uses HTTPS_PROXY_USERNAME while LWP uses HTTP_PROXY_USER. In my testing, the HTTPS_PROXY env setting still needs to be deleted then set again. Changing the env proxy block of my code to this works now: `my $https_proxy=$ENV{HTTPS_PROXY}; delete $ENV{HTTPS_PROXY} if ($https_proxy); $ua->env_proxy; $ENV{HTTPS_PROXY}=$https_proxy if ($https_proxy); $ENV{HTTPS_PROXY_USERNAME}=$ENV{HTTP_PROXY_USER}; $ENV{HTTPS_PROXY_PASSWORD}=$ENV{HTTP_PROXY_PASS};` [download] Yay! Now I have to go write the rest of what I'd set out to do in the first place. Another dead project lives again!	[reply] [d/l]