Re: How to scrape an HTTPS website that has JavaScript
by jeffa (Bishop) on Aug 23, 2004 at 16:08 UTC
|
Sometimes you can't do what you want, but sometimes you can grab the Javascript code and run it though Javascript. I have successfully done this, but it was a specialized case. Don't forget to install the necessary Javascript libriaries as prescribed in the docs for that module, by the way.
| [reply] |
|
|
jeffa, if you had any trivial sample code to do this, it would be great to see it. I've failed to do this before...
| [reply] |
|
|
Here ya go: it's old and the demo link is down because i am ... still being lazy. :D Still, it was fun while it lasted.
(jeffa) Re: Encrypt web files!
| [reply] |
Re: How to scrape an HTTPS website that has JavaScript
by tomhukins (Curate) on Aug 23, 2004 at 16:31 UTC
|
| [reply] |
|
|
tomhukins,
You have succesfully gotten HTTP::Recorder to do JavaScript? You might want to let the author know, since the docs for even the latest development release state it won't record JavaScript Actions.
| [reply] |
|
|
It was a few months ago, but I recall successfully using HTTP::Recorder to use HTTP::Proxy with JavaScript enabled sites. HTTP::Proxy traps all HTTP requests, regardless of whether they are plain HTML hyperlinks or initiated by JavaScript. Granted, this only works when JavaScript does something simple, but in my experience it usually does. If the JavaScript does anything complex, then you're right: you'll have to either run a JavaScript interpreter within Perl or rewrite the algorithm in Perl.
| [reply] |
|
|
I did let the author know. She said that it kind of depends on what your JavaScript does. However, this is more the case if you are trying to use HTTP::Recorder to generate QA scripts that test your JavaScript. I don't think there is anything JavaScript could do that would matter when scraping a site which would not be captured by an HTTP proxy.
| [reply] |
Re: How to scrape an HTTPS website that has JavaScript
by Happy-the-monk (Canon) on Aug 23, 2004 at 16:11 UTC
|
jeffa has pointed out one way to go.
Some other times, HTTP links buried in JavaScript are just HTTP links. You can follow those using WWW::Mechanize while you might need a browser to figure out the URLs.
Cheers, Sören
| [reply] |
Re: How to scrape an HTTPS website that has JavaScript
by dmorgo (Pilgrim) on Aug 23, 2004 at 19:27 UTC
|
I haven't used it, but I thought the module Javascript::SpiderMonkey was made for this kind of thing. Does anyone have experience with this module?
| [reply] |
|
|
As explained here, that's only one small part of the problem.
| [reply] |
Re: How to scrape an HTTPS website that has JavaScript
by saintmike (Vicar) on Aug 23, 2004 at 17:46 UTC
|
| [reply] |
Re: How to scrape an HTTPS website that has JavaScript
by elwarren (Priest) on Aug 24, 2004 at 18:18 UTC
|
I once needed to work with javascript. I just parsed the html and grepped for the variable initialization that I needed and took the value with a regex. No need to execute the code just to concatenate a couple of strings together.
As for the other part of your request regarding an https request, I've had mixed results. I have been unable to get https working through our proxy/firewall at work. I can go through the proxy via http without a problem. A bit of googling leads me to a link dated 2001 that says libwww and crypt::ssleay almost work together but don't.
Here's a short example I've used to illustrate the problem. It fails for me on ActiveState 5.8.4 on xp. (*nix is not an option as this is my work machine, nor is cygwin)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::UserAgent;
#http://groups.yahoo.com/group/libwww-perl/message/7242
$|=1;
my @hosts = ('http://login.yahoo.com', 'https://login.yahoo.com');
my $ua=LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 ");
my $https_proxy=$ENV{https_proxy};
delete $ENV{https_proxy} if ($https_proxy);
$ua->env_proxy;
$ENV{https_proxy}=$https_proxy if ($https_proxy);
foreach (@hosts) {
my $req = HTTP::Request->new(GET => $_);
#$req->proxy_authorization_basic($ENV{HTTP_PROXY_USER}, $ENV{HTTP_PR
+OXY_PASS});
my $res = $ua->request($req);
if ($res->is_success) { print $res->status_line, "\nsomething\n"; }
else { print $res->status_line, "\nnothing\n"; }
}
| [reply] [d/l] |
|
|
Woohoo! It works now :-) After digging this old problem out of the dead projects folder, I couldn't leave it alone. It seems that Crypt::SSLeay uses HTTPS_PROXY_USERNAME while LWP uses HTTP_PROXY_USER. In my testing, the HTTPS_PROXY env setting still needs to be deleted then set again. Changing the env proxy block of my code to this works now:
my $https_proxy=$ENV{HTTPS_PROXY};
delete $ENV{HTTPS_PROXY} if ($https_proxy);
$ua->env_proxy;
$ENV{HTTPS_PROXY}=$https_proxy if ($https_proxy);
$ENV{HTTPS_PROXY_USERNAME}=$ENV{HTTP_PROXY_USER};
$ENV{HTTPS_PROXY_PASSWORD}=$ENV{HTTP_PROXY_PASS};
Yay! Now I have to go write the rest of what I'd set out to do in the first place. Another dead project lives again! | [reply] [d/l] |