Stripping the contents of Javascript tags

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Stripping the contents of Javascript tags by Jenda (Abbot) on May 27, 2003 at 23:14 UTC
What about the JavaScript event handlers? You don't care about `<body onLoad="...">`?!? You'd better use some HTML filtering module. Using HTML::TagFilter, HTML::JFilter or HTML::Filter would be much safer. Jenda Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. -- Rick Osborne Edit by castaway: Closed small tag in signature	[reply] [d/l]
Re: Stripping the contents of Javascript tags by sauoq (Abbot) on May 27, 2003 at 23:18 UTC
It'll probably be worthwhile to carefully look at the pages it fails on (and where in the page it fails.) I'm just guessing, but it might be that you've found some endtags with space in them... if so, just be a little more liberal in how you match the endtag. This might do it: `$foo =~ s\|<\sscript[^>]>.?</\sscript\s*>\|\|gis;` -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Stripping the contents of Javascript tags by Cody Pendant (Prior) on May 28, 2003 at 00:39 UTC
`$foo =~ s\|<\sscript[^>]>.?</script>\|\|gis;` I don't understand the "\s" at the start of the regex -- what's that doing there? Any amount of whitespace before the "script"? That would render the HTML invalid, surely? I've never had any problems with `$html =~ s/<script[^>]>.?<\/script>//sgi` [download] so maybe there's something strange in your content? Randomly, what if there was some weird custom tag like `<scriptblock>` in there? You might need a \b after "script" in that case. (Searching for weird possibilities) What if you had `<script src="if-a->-b-script.js">` or something bizarre? Show us a page on which it fails to perform correctly. -- “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D	[reply] [d/l] [select]
Re: Stripping the contents of Javascript tags by Anonymous Monk on May 28, 2003 at 02:18 UTC
And what about a similar construct for stripping comments? I looked at HTML::TagFilter, but it wasn't so good at stripping Javascript tags, but the regex sauoq and Cody Pendant came up with both worked. Another weird thing is this Microsoft garbage: `<!--[if IE]><script language=javascript>ie5=1;</script><![endif]-->` [download] Using the regexes suggested, the `<script>..</script>` tags are removed, but the Microsoft garbage remains. Is there a way to roll two more regexes that can strip "normal" comments, and additionally strip this Microsoft comment garbage as well? (prominently found in Yahoo's main page) Thanks for the help.	[reply] [d/l] [select]
Re: Re: Stripping the contents of Javascript tags by Lachesis (Friar) on May 28, 2003 at 08:28 UTC
A comment stripper should pick up on the MS garbage as well as normal comments Personally I'd go for using HTML::Parser to reconstruct the file - a quick and dirty version would be something like: `use HTML::Parser; my $parser = HTML::Parser->new( api_version => 3, default_h => [sub { print $_[0] unless lc $_[1] eq 'script' }, 'te +xt,tagname'], comment_h=> [""], ); $parser->parse_file($file);` [download] This won't handle any javascript event handlers, for that you would need to go through the attributes of each tag and remove the event handlers. You would also need to check all your hrefs for javascript as well Update: The comment stripping will work but the default handler I wrote won't remove the script properly.	[reply] [d/l]
Re: Stripping the contents of Javascript tags by svsingh (Priest) on May 27, 2003 at 23:03 UTC
You're probably experiencing greed. This means the .* in the middle of your match will grab everything between the first <script> and last </script> in $foo, including other </script>...<script> combinations. The following code should give you some ideas. It does assume that there are no other HTML tags between the <script> and </script> tags. `$s = qq(<script>chunk1</script> stuff <Script>chunk2</Script>); while ( $s =~ m\|<script>([^<]+)<\/script>\|gi ) { print "match: $1\n"; }` [download] Also, I don't know if this was intentional or not, but you're also doing a substitution by starting your expression with an s. That will replace anything you match in $foo with nothing (effectively removing it from the scalar). I hope this is what you were looking for.	[reply] [d/l]
Re: Re: Stripping the contents of Javascript tags by Cody Pendant (Prior) on May 28, 2003 at 00:26 UTC
`It does assume that there are no other HTML tags between the <script> and </script> tags.` [download] Er, no it doesn't -- it assumes there are no HTML open-brackets inside the script tags, and as they are also the less-than signs, that's a fatally flawed assumption. Any JavaScript for-loop will kill that regex. -- “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D	[reply] [d/l]
Re: Re: Stripping the contents of Javascript tags by sauoq (Abbot) on May 27, 2003 at 23:20 UTC
This means the . in the middle of your match. . . Uhm... that `.` looks an awful lot like a `.*?` to me. ;-) -sauoq "My two cents aren't worth a dime.";	[reply]
Re: Re: Stripping the contents of Javascript tags by svsingh (Priest) on May 28, 2003 at 03:26 UTC
Ack! Sorry about that. I guess a little knowledge really can be dangerous. Thanks for letting me know where I went wrong.	[reply]