Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Parsing content found in onClick and window.open Javascript calls

by hacker (Priest)
on Sep 02, 2007 at 14:51 UTC ( [id://636607]=perlquestion: print w/replies, xml ) Need Help??

hacker has asked for the wisdom of the Perl Monks concerning the following question:

As many of you know, I do a lot of screen-scraping as part of my projects.

The best way to test a spider/screen-scraper written in perl (or any language) before pointing it to production content, is to run it against... pr0n sites. No, seriously!

  1. They love the traffic
  2. There's tons of links to images, popups, broken HTML, and so on.
  3. A well-behaved web spider would barely be a blip on their radar.

But back on track.. In some of the non-pr0n content (a big news site) I'm trying to scrape, there are links to sub-pages that I need content from, which are hidden inside onClick() and window.open calls via Javascript. You click a news article title, a window pops up and the content itself is in that secondary window.

I tried to use HTML::SimpleLinkExtor and friends to try to extract the links that point to those popup windows, but that module doesn't treat a remote URL inside a tag to be an href.

Here's a simplified example of what I'm trying to parse:

<td align="center" valign="middle"><a href = javascript:void(0) onmous +eover="window.status='This is my news article'; return true;" onmouse +out="window.status=''; return true;" onClick="window.open ('http://ne +ws.example.com/article0234/', 'News','alwaysRaised=1, toolbar=0, scro +llbars=0, location=0, statusbar=0, menubar=0, resizable=0, width=620, + height=400');" >New Link 0234</a></td>

In this code, clicking on "News Link 0234" on the main page will pop up a window that points to 'http://news.example.com/article0234/', and that popup window contains the content I need to scrape.

Has anyone tried to do this? I can do it with some really ugly regexes and grep(), but I'd prefer a cleaner option.

Replies are listed 'Best First'.
Re: Parsing content found in onClick and window.open Javascript calls
by moritz (Cardinal) on Sep 02, 2007 at 15:06 UTC
Re: Parsing content found in onClick and window.open Javascript calls
by Your Mother (Archbishop) on Sep 03, 2007 at 03:23 UTC

    Nice tips. :)

    Treat it as plain text and use URI::Find?

    Don't forget that one of the tricks they use is to make the JS hard to see so that filters/blockers won't catch them trying to put your browser into a circle-jerk. So you might, if you're being *thorough* have to do something crazy like (very unrefined)-

    my $esc = qr/[\\'," ]/; m,w(?:$esc)*i(?:$esc)*n(?:$esc)*d(?:$esc)*o(?:$esc)*w(?:$esc)*\.(?:$es +c)*o(?:$esc)*p(?:$esc)*e(?:$esc)*n(?:$esc)*\. ET CETERA,;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://636607]
Approved by varian
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (8)
As of 2024-04-23 14:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found