Am using Mechanize and TokeParser to scrape the main parts of the page. When it comes to testing for the icon and the underlying Javascript, these methods fail.

Can you show us the code for that, and explain how it fails for the sample of javascript that you're trying to deal with? When I've used HTML::TokeParser, it has been pretty clear about recognized the <script> ... </script> tags, and providing the stuff in between as a single "token". Once you have the token that consists of "script" data (the content between <script> and </script>, it's still up to you to figure out how to do what needs to be done with the script data.

Apparently, in the html snippet you are dealing with, the javascript content is an attribute of an anchor tag, which means that when TokeParser returns an array ref to a data structure like this:

[ 'S', # indicates a start tag 'a', # this is an anchor tag $attr, # hashref to tag attributes, incl. "Onclick:" => "javascri +pt code" $attrseq, # array ref to ordered list of %$attr. keys 'full text of anchor tag' ]
it's up to you to work over the value of $$attr{onclick} to get whatever information or result you want. At that point, things might still be a bit dicy, but something like this might do:
my $htm = HTML::TokeParser->new( \$your_html_page ); while ( my $tkn = $htm->get_token ) { if ( $$tkn[0] eq 'S' and $$tkn[1] eq 'a' and $$tkn[2]{onclick} =~ +/^javascript/ ) { my ( $url ) = ( $$tkn[2]{onclick} =~ /\("(http:.*?)"\)/ ); $url =~ s/\\u([0-9a-f]{4})/chr(hex($1))/ge; # do something with $url; } else { # do other stuff with other parts of html data... } }
The code snippet you posted was missing the initial "/" delimiter on the regex in the "if" condition (that's what Anonymous Monk was referring to).

In reply to Re: quotes in HTML killing regex by graff
in thread quotes in HTML killing regex by jhoskins98

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.