Re: quotes in HTML killing regex

Am using Mechanize and TokeParser to scrape the main parts of the page. When it comes to testing for the icon and the underlying Javascript, these methods fail.

Can you show us the code for that, and explain how it fails for the sample of javascript that you're trying to deal with? When I've used HTML::TokeParser, it has been pretty clear about recognized the <script> ... </script> tags, and providing the stuff in between as a single "token". Once you have the token that consists of "script" data (the content between <script> and </script>, it's still up to you to figure out how to do what needs to be done with the script data.

Apparently, in the html snippet you are dealing with, the javascript content is an attribute of an anchor tag, which means that when TokeParser returns an array ref to a data structure like this:

  [
    'S',  # indicates a start tag
    'a',  # this is an anchor tag
    $attr,  # hashref to tag attributes, incl. "Onclick:" => "javascri
+pt code"
    $attrseq,  # array ref to ordered list of %$attr. keys
    'full text of anchor tag'
  ]
[download]

it's up to you to work over the value of $$attr{onclick} to get whatever information or result you want. At that point, things might still be a bit dicy, but something like this might do:

my $htm = HTML::TokeParser->new( \$your_html_page );

while ( my $tkn = $htm->get_token ) {
    if ( $$tkn[0] eq 'S' and $$tkn[1] eq 'a' and $$tkn[2]{onclick} =~ 
+/^javascript/ ) {
        my ( $url ) = ( $$tkn[2]{onclick} =~ /\("(http:.*?)"\)/ );
        $url =~ s/\\u([0-9a-f]{4})/chr(hex($1))/ge;
        # do something with $url;
    }
    else {
        # do other stuff with other parts of html data...
    }
}
[download]

The code snippet you posted was missing the initial "/" delimiter on the regex in the "if" condition (that's what Anonymous Monk was referring to).

Comment on Re: quotes in HTML killing regex Select or Download Code