in reply to Removing Javascript

HTML::Parser isn't the answer.
#!/usr/bin/perl -w use strict; use English; use warnings; use HTML::Parser; { package JavascriptIsBad; use base 'HTML::Parser'; my $result; my $skipping = 0; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if (lc($tagname) eq 'script') { $skipping = 1; } $result .= $origtext unless $skipping; } sub end { my($self, $tagname, $origtext) = @_; $result .= $origtext unless $skipping; if (lc($tagname) eq 'script' and $skipping) { $skipping = 0; } } sub text { my($self, $origtext, $is_cdata) = @_; return if $skipping; $result .= $origtext; return; } sub result { $result } } my $p = JavascriptIsBad->new; $p->parse(<<EOF); <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); </script> </head> <body> This is just some text. </body> </html> EOF print 'Result: ', $p->result, "\n";
gets fooled by the first </script> tag.
--
Jeff Boes
Database Engineer
Nexcerpt, Inc.
vox 269.226.9550 ext 24
fax 269.349.9076
 http://www.nexcerpt.com
...Nexcerpt...Connecting People With Expertise

Replies are listed 'Best First'.
Re: Re: Removing Javascript
by Mur (Pilgrim) on Jan 02, 2003 at 20:57 UTC
    Ah! But TokeParser gets me where I want to go.
    #!/usr/bin/perl -w use strict; use English; use warnings; use HTML::TokeParser; my $doc = <<EOF; <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); </script> </head> <body> This is just some text. </body> </html> EOF my $p = HTML::TokeParser->new(\$doc); my $result; my $skipping = 0; while (my $tok = $p->get_token) { if ($tok->[0] eq 'S') { if (lc($tok->[1]) eq 'script') { $skipping = 1; } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } elsif ($tok->[0] eq 'E') { if (lc($tok->[1]) eq 'script') { $skipping = 0; } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } print 'Result: ', $result, "\n";
    --
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
     http://www.nexcerpt.com
    ...Nexcerpt...Connecting People With Expertise

      I see. So you delete everything from the <script>(inclusive) up to the first non<script> tag(exclusive) that follows a </script> tag. Clever. But whether this helps or not I really don't know.

      It will strip any text that might follow the </script>, (which may not matter if they only have <script> in the <head>) but these probably do matter

      <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); document.write("Even the <body> tag is important!"); </script> </head> <body> This is just some text. </body> </html>
      or
      <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); if (x<y) { alert("y > x") } </script> </head> <body> This is just some text. </body> </html>

      You'd have to parse the JavaScript (at least to some extent to be able to say whether the </script> is meant to close it or not.

      Actually I guess you'd only have to distinguish three states inside the JavaScript. "Inside a singlequoted string", "Inside a doublequoted string" and "Elsewhere". And you'd only treat the <script> as the closing tag in the "Elsewhere".

      Jenda

      Just a small change... I like it better like this:
      my $result; my $skip = 0; while (my $tok = $p->get_token) { my($ttype,$tag, $attr, $attrseq, $rawtxt) = @{ $tok }; $tag=lc $tag; $skip=1 if (($ttype eq 'S') && ($tag eq 'script')); if ((!$skip) && ($tag ne 'script')) { $result .= $rawtxt; $result .= $p->get_text; } $skip=0 if (($ttype eq 'E') && ($tag eq 'script')); }
Re: Re: Removing Javascript
by boo_radley (Parson) on Jan 02, 2003 at 20:56 UTC
    gets fooled by the first </script> tag.
    Something's getting fooled, but it's not HTML::Parser.
    You should be using the appropriate HTML entities in your javascript ( &lt; and &gt;) instead of using the actual closing tag in your document.write statement.
      That would be nice if I could force it on all the web page authors out there. But I can't. I'm parsing "real-world" HTML, and apparently browsers don't flag this as bad. 8-(
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise