Cody Pendant has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to scrape (public) information from a website.

The website is a horrible mess of frames within frames within frames and JavaScript adding and changing form values when they're submitted. Its hard to even isolate the documents which populate the frames because they immediately change and reload because they're not in the right frameset.

I can't automate the process of "click here, select that, put value in box" with WWW::Mechanize here because of all the JavaScript. And I don't want to get into trying to use OLE and controlling the browser, that way lies madness I suspect.

I decided to short-circuit the process by looking at the headers sent when I submitted the form request, post-JavaScript.

So I look at the headers (from the FireFox LiveHTTPHeaders extension) and they're like this: POST /cgi/script.pl?TimeStamp=1137799035141&Monitor=W17P&Server_Name=localhost foo=x&bar=y with a space after the "Server_Name=localhost", which is the first bit I don't understand. I don't know much about the guts of HTTP.

But, soldiering on, I replace that space with "&" and change the POST to a GET and it works, in LWP::Simple.

But it only works for a while. A few hours later the same request fails.

I know it looks as if that Timestamp value in the string might expire, but I can run the query without it, so that doesn't appear to be the problem.

I've tried doing it as an HTTP::Request, preserving the POST rather than changing it to a GET, but the web server gives an error, saying "Length required".

The script it's trying to access is perl, and it has some key-value pairs that might shed some light on it, if someone has seen them before or worked with the database it's accessing?

PACBASEID= (a 26-char alpahnumeric string)

MYCURSOR=(6 letters, underscore, 6 numbers)

for instance PACBASEID=0A4000400234502016252400LT&MYCURSOR=XRSDCT_000101

Do these variables perhaps form some sort of session ID? There aren't any cookies being set. There's also a "MY_PFKEY" key where "09" seems to mean "next page", which works when invoked repeatedly, so the server seems to be saving state somehow, i.e. it knows that "next" means records 50 - 100 the first time you run it with "MY_PFKEY=09" and 100-150 the second time and so on.

Anyway, anything anyone can suggest, any ideas gratefully received. Otherwise it's going to be a very long painful process of tracking JavaScript from frame to frame and document to document, which I thought I'd short-circuited.



($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
=~y~b-v~a-z~s; print

Replies are listed 'Best First'.
Re: Scraping a website - various problems
by stony (Initiate) on Jan 21, 2006 at 00:45 UTC
    I asked a very similar question yesterday. I was having trouble with the fact that the web page is written such that it only makes sense after a javascript engine has parsed it. I was told that LWP was giving me the same answer as File->SaveAs on the browser (not true). After that, I started looking into the JavaScript.pm package thinking I could perhaps run the source of the page through a javascript engine myself. I stopped there. It looked very daunting. Plus, in the end, I found a different way of solving the problem....

    Where cleverness fails, use brute force.

    However, as far as the page sometimes failing with no apparent reason....
    I have had that happen. I have not been able to debug it, but it appears that when I hit a url thousands of times, it will occasionally puke. It doesn't seem to die permanently though. I put in a "fail this many times" clause and fixed my problem... (brute force over cleverness)

    Stony
Re: Scraping a website - various problems
by simonm (Vicar) on Jan 21, 2006 at 04:43 UTC
    Depending on the database, the cursor might be a persistent result set containing the records that matched a given query, and there might be some process that cleared out inactive cursors, so that might well be why the original request works for a while and then stops working.
Re: Scraping a website - various problems
by ptum (Priest) on Jan 21, 2006 at 07:49 UTC

    One thing to remember when dealing with high traffic websites is that the webserver can actually be an array of webservers. One such website, I'm told, has more than 200 webservers which balance traffic among them, and (when new content is being rolled out) they don't necessarily all have the same version of web content or services.

    The behavior you describe has the sound of a webserver with locally cached data, which can be expected to be cleared out or expire over time, or differ in the case of an alternate webserver.

    It might be worth your while to see if there are any services deployed which would give you cleaner, more reliable access to the data you need.


    No good deed goes unpunished. -- (attributed to) Oscar Wilde
Re: Scraping a website - various problems
by BrowserUk (Patriarch) on Jan 21, 2006 at 08:00 UTC
    Do these variables perhaps form some sort of session ID?

    The presence of "PACBASEID" suggests that the site is probably using IBM's VA PacBase product. Does that help at all?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That's a good clue, thanks very much BrowserUK.


      ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
      =~y~b-v~a-z~s; print