GaijinPunch has asked for the wisdom of the Perl Monks concerning the following question:

Hey monks... I've not been here in a long while. In fact, I've been cheating on you by using PHP in one of my larger projects. I know, tsk tsk. I do use Perl at work on admittedly rather menial stuff day in day out though. Anyhoo...

Long story short I'm using Curl in PHP to do the following:
Slurp -> Find Link to Login page, and follow -> Find login form and follow -> Reslurp original page (but logged in).
This worked fine and dandy until a change on the server side, which I have determined now checks for Javascript and forces a Captcha in the case that it is disabled. If I disable javascript in any single step of the above sequence in a browser, the Captcha is forced. I'm using HttpFox to look at the packets, and it seems that cookies are written in a pretty standard fashion. (I get two in response to the first slurp, which uses GET). In a browser w/ JavaScript enabled, I get another 5 or so after I follow the login form. The headers being sent are identical in the browser & the Curl script, but Curl ignores the javascripts of course. I know the magic is somehow set in either a <script> tag or a <noscript> tag. If Javascript is enabled I get a very non-trivial script, plus about 5 cookies. I don't know if it does anything but it can't be ignored. It's 189k, and relatively daunting. When I follow the login form w/ my Curl PHP script (w/ JS obviously unavailable) the scripts are ignored, I get no cookies, and the captcha appears. So, I'm now weighing my options.

*A long time ago I did something similar with WWW::Mechanize, and I'm not totally opposed to doing it now. I see there is a JavaScript plugin (I think this is relatively new). I'm curious as to how it works though. Can I be somewhat sure that the server side will assume I'm a browser w/ JS enabled if I use the plugin?

*How could the server side be internally storing whether I am using Javascript or not? I assume a JavaScript the browser has executed could access some URL which would then store whether I'm a browser or a scraper.

*Any other one I'm not thinking of?

Cheers for any advice!
  • Comment on Mechanize, Javascript, Cookies, and you!

Replies are listed 'Best First'.
Re: Mechanize, Javascript, Cookies, and you!
by Marshall (Canon) on Jun 21, 2011 at 02:33 UTC
    You may need WWW-Mechanize-Firefox. Implementing a javascript engine is hard. The idea behind Mechanize-Firefox is to control firefox and have it do that part of the job. A cool idea.
      Checking into this -- this looks like it would be right up my alley. I will read through it, but something I'm wondering: does it require X? This will eventually run on a remote machine w/o an X-server. :)

      Anyway, cheers -- got something to explore for a while, that's for sure.

        The people who use WWW::Mechanize::Firefox without a display usually seem to run a VNC server and redirect Firefox to it.

        When I started playing with the mozrepl interface, I logged on with putty. Firefox has to be running, but don't think that you will need to "watch" the screen. Read the docs, install the firefox add-on, then play a bit to see how it works. This interface is what Mechanize-Firefox talks to. might want to google mozrepl also. Have fun!
Re: Mechanize, Javascript, Cookies, and you!
by Anonymous Monk on Jun 21, 2011 at 02:53 UTC
Re: Mechanize, Javascript, Cookies, and you!
by aquarium (Curate) on Jun 21, 2011 at 04:25 UTC
    maybe jmeter could do it or something like serverside javascript like rhino...but seems you're heading down a slippery slope anyway. make sure you setup your http headers and response codes to look like a browser. but as per my slippery slope observation, these kinds of systems are also usually rigged with a bit of random checks anyway with captcha showing up. in the event someone is watching and doesn't appreciate your automation attempts, they may even start popping up more captchas based on your ip address etc., or could start popping up captchas incessantly if they're really against this...and thus defeat all your coding powers at press of a button.
    the hardest line to type correctly is: stty erase ^H
      Indeed, advice to be considered. All in all, yes, I think they are against it, but not going to risk pissing of everyone. Without giving away too much, as my scraper translates for those who otherwise not be a customer, it's in their best interest to allow it. However, they've also gotta think about malicious scrapers. I'll bet anyone a shiny quarter the beef up in security is in response to Sony getting sodomized recently.