Removing Javascript

Mur has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Removing Javascript
by Ovid (Cardinal) on Jan 02, 2003 at 21:03 UTC

HTML::TokeParser::Simple to the rescue. You said you wanted to remove everything "between" the tags, so I'm leaving the tags in. This should be relatively easy to fix if you want to also strip the script tags.

#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple 1.4;

my $parser = HTML::TokeParser::Simple->new( *DATA );

my $html      = '';
my $is_script = 0;

while ( my $token = $parser->get_token ) {
  $html .= $token->as_is unless $is_script;
  if ( $token->is_start_tag('script') ) {
    $is_script = 1;
  }
  elsif ( $token->is_end_tag('script') ) {
    $is_script = 0;
    $html .= $token->as_is;
  }
}
print $html;

__DATA__
<title>foobar</title>
<script language="Javascript">
foo bar foo bar
</script>
You fail if you remove this line!
<script language="Javascript">
bar foo bar foo
</script>
[download]

Cheers,
Ovid

New address of my CGI Course.
Silence is Evil (feel free to copy and distribute widely - note copyright text)

[reply]
[d/l]

Re: Removing Javascript
by cLive ;-) (Prior) on Jan 03, 2003 at 01:02 UTC

Without knowing the context in which you want to strip - security? - it's hard to suggest best method.

What I would do, if possible, is work the other way round. Define a list of HTML tags/attributes that are valid and strip out everthing else.

.02

cLive ;-)

[reply]

Re: Re: Removing Javascript

by Jenda (Abbot) on Jan 03, 2003 at 14:15 UTC

You are right. HTML::TagFilter will help with this.

Jenda

P.S.: I have something similar here. I planned to release it as HTML::TagFilter when I'm satisfied with the code, but William was quicker :-)

[reply]

Re: Removing Javascript
by joe++ (Friar) on Jan 02, 2003 at 20:22 UTC

Mur

The problem here is - as always - the behaviour of some web browsers, that actually render the most crappy html/javascript code when they really shouldn't.

BTW, I'm intrpreting your second code example as being Javascript, rather than PHP (what it more looks like), so it would read like this:

<script language="Javascript">
 document.write("Always use </script>!");
</script>
[download]

HTML Tidy

LibXML's xmllint

Then, if your source can be corrected this way and converted into well formed xhtml, you even can think of applying a very minimal XSLT stylesheet which leaves out all the <script/> elements.

(yes, I know, this has nothing to do with Perl, but this is how I solved this kind of problems many times already).

--
Cheers, Joe

[reply]
[d/l]
[select]

Re: Removing Javascript
by Ionizor (Pilgrim) on Jan 02, 2003 at 20:25 UTC

It sounds like you're looking for a regex. 99% of the time slicing up HTML with regexes is the wrong thing to do. I would try looking into HTML::TokeParser or similar modules to do this.

If you're using XHTML you can use an XML parser such as XML::Simple or XML::Parser but that may be overkill.

[reply]

Re: Removing Javascript
by chromatic (Archbishop) on Jan 02, 2003 at 20:27 UTC

Sounds like you need an HTML Parser. Hmm, HTML::TokeParser?

[reply]

Re: Removing Javascript
by Mur (Pilgrim) on Jan 02, 2003 at 20:36 UTC

#!/usr/bin/perl -w
use strict;
use English;
use warnings;
use HTML::Parser;

{
  package JavascriptIsBad;

  use base 'HTML::Parser';

  my $result;
  my $skipping = 0;

  sub start {
    my($self, $tagname, $attr, $attrseq, $origtext) = @_;
    if (lc($tagname) eq 'script') {
      $skipping = 1;
    }
    $result .= $origtext unless $skipping;
  }

  sub end {
    my($self, $tagname, $origtext) = @_;
    $result .= $origtext unless $skipping;
    if (lc($tagname) eq 'script' and $skipping) {
      $skipping = 0;
    }
  }

  sub text {
    my($self, $origtext, $is_cdata) = @_;
    return if $skipping;
    $result .= $origtext;
    return;
  }

  sub result { $result }
 }

 my $p = JavascriptIsBad->new;
 $p->parse(<<EOF);
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
</script>
</head>
<body>
This is just some text.
</body>
</html>
EOF

print 'Result: ', $p->result, "\n";
[download]

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

[reply]
[d/l]
[select]

Re: Re: Removing Javascript

by Mur (Pilgrim) on Jan 02, 2003 at 20:57 UTC

#!/usr/bin/perl -w
use strict;
use English;
use warnings;
use HTML::TokeParser;

my $doc = <<EOF;
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
</script>
</head>
<body>
This is just some text.
</body>
</html>
EOF

my $p = HTML::TokeParser->new(\$doc);

my $result;
my $skipping = 0;
while (my $tok = $p->get_token) {
  if ($tok->[0] eq 'S') {
    if (lc($tok->[1]) eq 'script') {
      $skipping = 1;
    } elsif (!$skipping) {
      $result .= $tok->[-1];
      $result .= $p->get_text;
    }
  } elsif ($tok->[0] eq 'E') {
    if (lc($tok->[1]) eq 'script') {
      $skipping = 0;
    } elsif (!$skipping) {
      $result .= $tok->[-1];
      $result .= $p->get_text;
    }
  } elsif (!$skipping) {
    $result .= $tok->[-1];
    $result .= $p->get_text;
  }
}

print 'Result: ', $result, "\n";
[download]

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

[reply]
[d/l]
[select]

Re: Re: Re: Removing Javascript

by Jenda (Abbot) on Jan 02, 2003 at 23:44 UTC

I see. So you delete everything from the <script>(inclusive) up to the first non<script> tag(exclusive) that follows a </script> tag. Clever. But whether this helps or not I really don't know.

It will strip any text that might follow the </script>, (which may not matter if they only have <script> in the <head>) but these probably do matter

<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
document.write("Even the <body> tag is important!");
</script>
</head>
<body>
This is just some text.
</body>
</html>
[download]

<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
if (x<y) { alert("y > x") }
</script>
</head>
<body>
This is just some text.
</body>
</html>
[download]

You'd have to parse the JavaScript (at least to some extent to be able to say whether the </script> is meant to close it or not.

Actually I guess you'd only have to distinguish three states inside the JavaScript. "Inside a singlequoted string", "Inside a doublequoted string" and "Elsewhere". And you'd only treat the <script> as the closing tag in the "Elsewhere".

Jenda

[reply]
[d/l]
[select]

Re: Re: Re: Removing Javascript

by osama (Scribe) on Jan 03, 2003 at 20:52 UTC

my $result;
my $skip = 0;
while (my $tok = $p->get_token) {
 my($ttype,$tag, $attr, $attrseq, $rawtxt) = @{ $tok };
  $tag=lc $tag;
  $skip=1 if (($ttype eq 'S') &&  ($tag eq 'script'));
    if ((!$skip) && ($tag ne 'script')) {
      $result .= $rawtxt;
      $result .= $p->get_text;
    }
  $skip=0 if (($ttype eq 'E') &&  ($tag eq 'script'));

}
[download]

[reply]
[d/l]

Re: Re: Removing Javascript

by boo_radley (Parson) on Jan 02, 2003 at 20:56 UTC

gets fooled by the first </script> tag.

document.write

[reply]

Re: Re: Re: Removing Javascript

by Mur (Pilgrim) on Jan 02, 2003 at 21:00 UTC

Jeff Boes

Database Engineer

Nexcerpt, Inc.

vox 269.226.9550 ext 24

fax 269.349.9076

http://www.nexcerpt.com

...Nexcerpt...Connecting People With Expertise

[reply]
[d/l]

Re: Removing Javascript
by dmitri (Priest) on Jan 02, 2003 at 20:56 UTC

<script language="Javascript">
echo "Always use </script>!";
</script>
[download]

<script language="Javascript">
echo "Always use <\/script>!";
</script>
[download]

[reply]
[d/l]
[select]

Re: Removing Javascript
by jacques (Priest) on Jan 03, 2003 at 02:14 UTC

s/<(?:[^>'"]*|".*?"|'.*?')+script(?:[^<'"]*|".*?"|'.*?')+>.*?<(?:[^>'"
+]*|".*?"|'.*?')+\/.*?script(?:[^<'"]*|".*?"|'.*?')+>/<script><\/scrip
+t>/igsx;
[download]

Quick 'n dirty. But Ovid's solution is more reliable. Also I totally agree with cLive.

[reply]
[d/l]

Re: Removing Javascript
by domm (Chaplain) on Jan 04, 2003 at 13:00 UTC

HTML::Tree

Please note that this doesn't handle the unescaped closing script tag in document.write. I'd suggest running tidy on the input before passing it to the parser.

#!/usr/bin/perl -w
use strict;
use HTML::Tree;

my $doc = <<EOF;
<html>
<head>
<script language="Javascript">
document.write("Don't forget your &lt;/script&gt; tag! It's important!
+");
</script>
</head>
<body>
This is just some text.
</body>
</html>
EOF

my $root=HTML::TreeBuilder->new();

$root->parse($doc);
$root->eof;

foreach my $n ($root->descendants) {
    next unless $n->tag;   # skip text nodes
    $n->delete if $n->tag eq 'script';
}


print $root->dump;    # prints structure
print $root->as_HTML  # prints as HTML
[download]

--
#!/usr/bin/perl
for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
[download]

[reply]
[d/l]
[select]