jhoskins98 has asked for the wisdom of the Perl Monks concerning the following question:

I have searched and I seem to be striking out. Have a web page (well a bunch of them) that I am scraping - pulling together the permissions off of a Sharepoint site (WSS actually). Am using Mechanize and TokeParser to scrape the main parts of the page. When it comes to testing for the icon and the underlying Javascript, these methods fail. So I found a section of HTML where the forward arrow would be located between two tags with unique classes. So I was going to grab the area between the tags and then test and extract the http address. Source HTML:
.........<TD align=center Class="ms-vb" id="bottomPagingCell"><table>< +td nowrap class="ms-paging">1&nbsp;-&nbsp;100</td><td> <A HREF="javascript:" OnClick='javascript:SubmitFormPost(" +http:\u002f\u002fwdcbeta.sharepointsite.net\u002f_layouts\u002fpeople +.aspx?Paged=TRUE\u0026p_FSObjType=0\u0026p_Title=Michl\u002520Non\u00 +26p_ID=119\u0026View=\u00257b4B73D499\u00252dCD8D\u00252d4DE4\u00252d +ABB5\u00252d1911503FA95C\u00257d\u0026FilterField1=ContentType\u0026F +ilterValue1=Person\u0026MembershipGroupId=0\u0026PageFirstRow=101");j +avascript:return false;'><img src="/_layouts/1033/images/next.gif" bo +rder=0 alt="Next"></A></td></tr></table></TD></TR> <TR><TD class="ms-bottompagingline3">.......
The code:
my $stream2 = $mech1->{content}; (my $stream_chunk) = $stream2 =~ /bottomPagingCell(.*)bottom/; # <---- + fails here if (/javascript:SubmitFormPost\(\"(.*)\"\)/) { mech1->get($1); } else { # no new page }
It fails to find the first regex match. Seems like the quotes in the javascript is killing the match. Any suggestions? Update: It is failing to find the section of HTML, part of which is above.

Replies are listed 'Best First'.
Re: quotes in HTML killing regex
by graff (Chancellor) on Nov 01, 2008 at 03:38 UTC
    Am using Mechanize and TokeParser to scrape the main parts of the page. When it comes to testing for the icon and the underlying Javascript, these methods fail.

    Can you show us the code for that, and explain how it fails for the sample of javascript that you're trying to deal with? When I've used HTML::TokeParser, it has been pretty clear about recognized the <script> ... </script> tags, and providing the stuff in between as a single "token". Once you have the token that consists of "script" data (the content between <script> and </script>, it's still up to you to figure out how to do what needs to be done with the script data.

    Apparently, in the html snippet you are dealing with, the javascript content is an attribute of an anchor tag, which means that when TokeParser returns an array ref to a data structure like this:

    [ 'S', # indicates a start tag 'a', # this is an anchor tag $attr, # hashref to tag attributes, incl. "Onclick:" => "javascri +pt code" $attrseq, # array ref to ordered list of %$attr. keys 'full text of anchor tag' ]
    it's up to you to work over the value of $$attr{onclick} to get whatever information or result you want. At that point, things might still be a bit dicy, but something like this might do:
    my $htm = HTML::TokeParser->new( \$your_html_page ); while ( my $tkn = $htm->get_token ) { if ( $$tkn[0] eq 'S' and $$tkn[1] eq 'a' and $$tkn[2]{onclick} =~ +/^javascript/ ) { my ( $url ) = ( $$tkn[2]{onclick} =~ /\("(http:.*?)"\)/ ); $url =~ s/\\u([0-9a-f]{4})/chr(hex($1))/ge; # do something with $url; } else { # do other stuff with other parts of html data... } }
    The code snippet you posted was missing the initial "/" delimiter on the regex in the "if" condition (that's what Anonymous Monk was referring to).
Re: quotes in HTML killing regex
by AnomalousMonk (Archbishop) on Nov 01, 2008 at 16:51 UTC
    I don't fully understand what you are trying to accomplish — but that won't keep me from putting in my two cents!

    One thing strikes me about the regex and the example HTML text you have provided: the regex
        /bottomPagingCell(.*)bottom/
    seems to be trying to match text in the example that seems (depending on how the OP text was rendered and/or mis-typod) to include at least one newline.

    The . (dot) regex metacharacter does not normally match newlines. Use the //s regex switch to enable the 'dot-matches-all' (or, more specifically, dot-matches-newline) mode of regex match.

    See the section on Modifiers in perlre for more info.

Re: quotes in HTML killing regex
by Anonymous Monk on Nov 01, 2008 at 02:27 UTC
    It fails to find the first regex match. Seems like the quotes in the javascript is killing the match. Any suggestions?
    Convert this to valid perl
    if (javascript:SubmitFormPost\(\"(.*)\"\)/) mech1->get($1); } else { # no new page }
      Thanks - serves me right for just retype something rather than cut and paste - for this post.