Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I've got a webpage in a scalar, $foo, and I want to go through it and strip everything between the <script type="text/javascript"> starting tag, and the ending </script> tag. So far, I've got this small regex, which works, but seems to slurp a larger part of the page than I want, randomly.
$foo =~ s|<\s*script[^>]*>.*?</script>||gis;

Can someone help me shore this up a bit?

Replies are listed 'Best First'.
Re: Stripping the contents of Javascript tags
by Jenda (Abbot) on May 27, 2003 at 23:14 UTC

    What about the JavaScript event handlers? You don't care about <body onLoad="...">?!?

    You'd better use some HTML filtering module. Using HTML::TagFilter, HTML::JFilter or HTML::Filter would be much safer.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

    Edit by castaway: Closed small tag in signature

Re: Stripping the contents of Javascript tags
by sauoq (Abbot) on May 27, 2003 at 23:18 UTC

    It'll probably be worthwhile to carefully look at the pages it fails on (and where in the page it fails.)

    I'm just guessing, but it might be that you've found some endtags with space in them... if so, just be a little more liberal in how you match the endtag.

    This might do it:

    $foo =~ s|<\s*script[^>]*>.*?</\s*script\s*>||gis;
    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Stripping the contents of Javascript tags
by Cody Pendant (Prior) on May 28, 2003 at 00:39 UTC
     $foo =~ s|<\s*script[^>]*>.*?</script>||gis;

    I don't understand the "\s*" at the start of the regex -- what's that doing there? Any amount of whitespace before the "script"? That would render the HTML invalid, surely?

    I've never had any problems with

    $html =~ s/<script[^>]*>.*?<\/script>//sgi

    so maybe there's something strange in your content?

    Randomly, what if there was some weird custom tag like <scriptblock> in there? You might need a \b after "script" in that case.

    (Searching for weird possibilities) What if you had  <script src="if-a->-b-script.js"> or something bizarre?

    Show us a page on which it fails to perform correctly.
    --

    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
    M-J D
Re: Stripping the contents of Javascript tags
by Anonymous Monk on May 28, 2003 at 02:18 UTC
    And what about a similar construct for stripping comments? I looked at HTML::TagFilter, but it wasn't so good at stripping Javascript tags, but the regex sauoq and Cody Pendant came up with both worked. Another weird thing is this Microsoft garbage:
    <!--[if IE]><script language=javascript>ie5=1;</script><![endif]-->

    Using the regexes suggested, the <script>..</script> tags are removed, but the Microsoft garbage remains. Is there a way to roll two more regexes that can strip "normal" comments, and additionally strip this Microsoft comment garbage as well? (prominently found in Yahoo's main page)

    Thanks for the help.

      A comment stripper should pick up on the MS garbage as well as normal comments
      Personally I'd go for using HTML::Parser to reconstruct the file - a quick and dirty version would be something like:
      use HTML::Parser; my $parser = HTML::Parser->new( api_version => 3, default_h => [sub { print $_[0] unless lc $_[1] eq 'script' }, 'te +xt,tagname'], comment_h=> [""], ); $parser->parse_file($file);
      This won't handle any javascript event handlers, for that you would need to go through the attributes of each tag and remove the event handlers.
      You would also need to check all your hrefs for javascript as well
      Update: The comment stripping will work but the default handler I wrote won't remove the script properly.
Re: Stripping the contents of Javascript tags
by svsingh (Priest) on May 27, 2003 at 23:03 UTC
    You're probably experiencing greed. This means the .* in the middle of your match will grab everything between the first <script> and last </script> in $foo, including other </script>...<script> combinations.

    The following code should give you some ideas. It does assume that there are no other HTML tags between the <script> and </script> tags.

    $s = qq(<script>chunk1</script> stuff <Script>chunk2</Script>); while ( $s =~ m|<script>([^<]+)<\/script>|gi ) { print "match: $1\n"; }

    Also, I don't know if this was intentional or not, but you're also doing a substitution by starting your expression with an s. That will replace anything you match in $foo with nothing (effectively removing it from the scalar).

    I hope this is what you were looking for.

      It does assume that there are no other HTML tags between the <script> and </script> tags.

      Er, no it doesn't -- it assumes there are no HTML open-brackets inside the script tags, and as they are also the less-than signs, that's a fatally flawed assumption. Any JavaScript for-loop will kill that regex.
      --

      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.”
      M-J D
      This means the .* in the middle of your match. . .

      Uhm... that .* looks an awful lot like a .*? to me. ;-)

      -sauoq
      "My two cents aren't worth a dime.";
      
      Ack! Sorry about that. I guess a little knowledge really can be dangerous. Thanks for letting me know where I went wrong.