The website is a horrible mess of frames within frames within frames and JavaScript adding and changing form values when they're submitted. Its hard to even isolate the documents which populate the frames because they immediately change and reload because they're not in the right frameset.
I can't automate the process of "click here, select that, put value in box" with WWW::Mechanize here because of all the JavaScript. And I don't want to get into trying to use OLE and controlling the browser, that way lies madness I suspect.
I decided to short-circuit the process by looking at the headers sent when I submitted the form request, post-JavaScript.
So I look at the headers (from the FireFox LiveHTTPHeaders extension) and they're like this: POST /cgi/script.pl?TimeStamp=1137799035141&Monitor=W17P&Server_Name=localhost foo=x&bar=y with a space after the "Server_Name=localhost", which is the first bit I don't understand. I don't know much about the guts of HTTP.
But, soldiering on, I replace that space with "&" and change the POST to a GET and it works, in LWP::Simple.
But it only works for a while. A few hours later the same request fails.
I know it looks as if that Timestamp value in the string might expire, but I can run the query without it, so that doesn't appear to be the problem.
I've tried doing it as an HTTP::Request, preserving the POST rather than changing it to a GET, but the web server gives an error, saying "Length required".
The script it's trying to access is perl, and it has some key-value pairs that might shed some light on it, if someone has seen them before or worked with the database it's accessing?
PACBASEID= (a 26-char alpahnumeric string)
MYCURSOR=(6 letters, underscore, 6 numbers)
for instance PACBASEID=0A4000400234502016252400LT&MYCURSOR=XRSDCT_000101
Do these variables perhaps form some sort of session ID? There aren't any cookies being set. There's also a "MY_PFKEY" key where "09" seems to mean "next page", which works when invoked repeatedly, so the server seems to be saving state somehow, i.e. it knows that "next" means records 50 - 100 the first time you run it with "MY_PFKEY=09" and 100-150 the second time and so on.
Anyway, anything anyone can suggest, any ideas gratefully received. Otherwise it's going to be a very long painful process of tracking JavaScript from frame to frame and document to document, which I thought I'd short-circuited.
($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print
In reply to Scraping a website - various problems by Cody Pendant
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |