Parsing content found in onClick and window.open Javascript calls

hacker has asked for the wisdom of the Perl Monks concerning the following question:

As many of you know, I do a lot of screen-scraping as part of my projects.

The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!

They love the traffic
There's tons of links to images, popups, broken HTML, and so on.
A well-behaved web spider would barely be a blip on their radar.

But back on track.. In some of the non-pr0n content (a big news site) I'm trying to scrape, there are links to sub-pages that I need content from, which are hidden inside onClick() and window.open calls via Javascript. You click a news article title, a window pops up and the content itself is in that secondary window.

I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.

Here's a simplified example of what I'm trying to parse:

<td align="center" valign="middle"><a href = javascript:void(0) onmous
+eover="window.status='This is my news article'; return true;" onmouse
+out="window.status=''; return true;" onClick="window.open ('http://ne
+ws.example.com/article0234/', 'News','alwaysRaised=1, toolbar=0, scro
+llbars=0, location=0, statusbar=0, menubar=0, resizable=0, width=620,
+ height=400');" >New Link 0234</a></td>
[download]

In this code, clicking on "News Link 0234" on the main page will pop up a window that points to 'http://news.example.com/article0234/', and that popup window contains the content I need to scrape.

Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.

Comment on Parsing content found in onClick and window.open Javascript calls Download Code

Replies are listed 'Best First'.
Re: Parsing content found in onClick and window.open Javascript calls by moritz (Cardinal) on Sep 02, 2007 at 15:06 UTC
You could grep for `use Regexp::Common qw /URI/; if m/window\.open$["']($RE{URI}{HTTP})["']$/ { print $1; }` [download] Perl 6 in German -- Difficult Sudoku	[reply] [d/l]
Re: Parsing content found in onClick and window.open Javascript calls by Your Mother (Archbishop) on Sep 03, 2007 at 03:23 UTC
Nice tips. :) Treat it as plain text and use URI::Find? Don't forget that one of the tricks they use is to make the JS hard to see so that filters/blockers won't catch them trying to put your browser into a circle-jerk. So you might, if you're being thorough have to do something crazy like (very unrefined)- `my $esc = qr/[\\'," ]/; m,w(?:$esc)i(?:$esc)n(?:$esc)d(?:$esc)o(?:$esc)w(?:$esc)\.(?:$es +c)o(?:$esc)p(?:$esc)e(?:$esc)n(?:$esc)*\. ET CETERA,;` [download]	[reply] [d/l]


The stupid question is the question not asked
	PerlMonks