I'm trying to scrape (public) information from a website.

The website is a horrible mess of frames within frames within frames and JavaScript adding and changing form values when they're submitted. Its hard to even isolate the documents which populate the frames because they immediately change and reload because they're not in the right frameset.

I can't automate the process of "click here, select that, put value in box" with WWW::Mechanize here because of all the JavaScript. And I don't want to get into trying to use OLE and controlling the browser, that way lies madness I suspect.

I decided to short-circuit the process by looking at the headers sent when I submitted the form request, post-JavaScript.

So I look at the headers (from the FireFox LiveHTTPHeaders extension) and they're like this: POST /cgi/script.pl?TimeStamp=1137799035141&Monitor=W17P&Server_Name=localhost foo=x&bar=y with a space after the "Server_Name=localhost", which is the first bit I don't understand. I don't know much about the guts of HTTP.

But, soldiering on, I replace that space with "&" and change the POST to a GET and it works, in LWP::Simple.

But it only works for a while. A few hours later the same request fails.

I know it looks as if that Timestamp value in the string might expire, but I can run the query without it, so that doesn't appear to be the problem.

I've tried doing it as an HTTP::Request, preserving the POST rather than changing it to a GET, but the web server gives an error, saying "Length required".

The script it's trying to access is perl, and it has some key-value pairs that might shed some light on it, if someone has seen them before or worked with the database it's accessing?

PACBASEID= (a 26-char alpahnumeric string)

MYCURSOR=(6 letters, underscore, 6 numbers)

for instance PACBASEID=0A4000400234502016252400LT&MYCURSOR=XRSDCT_000101

Do these variables perhaps form some sort of session ID? There aren't any cookies being set. There's also a "MY_PFKEY" key where "09" seems to mean "next page", which works when invoked repeatedly, so the server seems to be saving state somehow, i.e. it knows that "next" means records 50 - 100 the first time you run it with "MY_PFKEY=09" and 100-150 the second time and so on.

Anyway, anything anyone can suggest, any ideas gratefully received. Otherwise it's going to be a very long painful process of tracking JavaScript from frame to frame and document to document, which I thought I'd short-circuited.



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

In reply to Scraping a website - various problems by Cody Pendant

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.