rduke15 has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I need to parse an HTML page, but the page seems to be almost completely generated by silly javascript calls to document.write(). (Incidentally, it's my bank's login page, and I wonder if they consider this as some sort of added security. Fortunately they have some real security too.)

I'm sure not the first one who would like to convert html with javascript into plain html, yet I cannot find a script doing it, or even a web page talking about it. CPAN has 2 Javascript modules, but it's not very clear for me how I could make them do what I want.

Would someone have come across something like that?

Thanks.

Replies are listed 'Best First'.
Re: javascript to html to perl
by Adrade (Pilgrim) on May 10, 2005 at 21:53 UTC
    I would do something like this... lets pretend that your document is in $data

    print join('',($data =~ m/document\.write\(['"]([^)]+)['"]/sg));
    This would print everything enclosed within document.write() throughout the doc.

    Hope this helps,
      -Adam

      Yes, that was my first approach. But it is very fragile. To begin with, there are plenty of escaped double-quotes. These wouldn't be hard to take care of, but it will inevitably continue with stuff breaking my regexes.

      I don't want to end up trying to write a javascript parser in Perl, when there are excellent javascript interpreters already in the browsers. There must be a way to make them work for my Perl script instead of the browser window.

        Well, the regex above will still work with escaped quotes... it will match until the last quotemark before the end parentheses, so all escaped quotes get included.

          -A
Re: javascript to html to perl
by scmason (Monk) on May 10, 2005 at 22:44 UTC
    I am assuming that you want the data after the javascript writes have been performed (not just the raw html doc without the writes performed). If so, here is a trick:

    ELinks is a text only browser with some javascript support. You might want to script it to download and save the rendered page, then run your script across that.

    You might consider the same thing with mozilla. Allow it to perform all of the javascript writes, and then scrape that. With a little poking around you will find that mozilla can be pretty easy to work with(look at JRex as a possible start). The idea here is: let a javascript rendering engine perform all the writes, then scrape the results.

      Elinks sounds good, but for this project, I would like it to run in Win32.

      JRex allows embedding a browser into Java. I guess I'm after a mini-PRex, which would embed a mini (text-only) browser into Perl.

      Will I have to wait for Perl 6?... :-)

        I guess I'm after a mini-PRex, which would embed a mini (text-only) browser into Perl

        Or you could make one yourself. Should only take a loooooooooong time, and we all have time, right?

        It is important to note how EASY it is to actually write a mozilla based app. Please see the open licensed Rapid Application Development with Mozilla Book. You might find that Perl is not needed at this stage, but simply throwing together some mozilla components to do what you want.

        I would like it (ELinks) to run in Win32.

        Is cygwin out of the question? It may compile there. If so, then how about this:

        I know that Visual Basic is a curse word(s), but I believe in using the best tool for the job, so you could easily embed an (ie) browser in a vb app that could do what you need easily (as far as rendering/saving). It is fairly easy to use and fairly well documented.

Re: javascript to html to perl
by davidrw (Prior) on May 11, 2005 at 01:56 UTC
    This goes along the same lines as the Elinks/mozilla suggestion w/getting an existing browser to do the javascript work for you, and you stated you wanted this on win32 -- I haven't done it myself, but i know that Win32::OLE can be used to control Internet Explorer, so you could access the rendered page post-javascript processing through that. The only thing i've personally done is:
    my $ie = Win32::OLE->new('InternetExplorer.Application'); $ie->Navigate($url); $ie->{Visible} = 1;
    I didn't need anything more than the above, so i didn't delve into this information yet, but it looked very promising:
    perl.com: Automating Windows Applications with Win32::OLE
    Simple Automation Module For Internet Explorer
Re: javascript to html to perl
by Joost (Canon) on May 10, 2005 at 22:01 UTC