in reply to Re: Removing Javascript
in thread Removing Javascript

Ah! But TokeParser gets me where I want to go.
#!/usr/bin/perl -w use strict; use English; use warnings; use HTML::TokeParser; my $doc = <<EOF; <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); </script> </head> <body> This is just some text. </body> </html> EOF my $p = HTML::TokeParser->new(\$doc); my $result; my $skipping = 0; while (my $tok = $p->get_token) { if ($tok->[0] eq 'S') { if (lc($tok->[1]) eq 'script') { $skipping = 1; } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } elsif ($tok->[0] eq 'E') { if (lc($tok->[1]) eq 'script') { $skipping = 0; } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } print 'Result: ', $result, "\n";
--
Jeff Boes
Database Engineer
Nexcerpt, Inc.
vox 269.226.9550 ext 24
fax 269.349.9076
 http://www.nexcerpt.com
...Nexcerpt...Connecting People With Expertise

Replies are listed 'Best First'.
Re: Re: Re: Removing Javascript
by Jenda (Abbot) on Jan 02, 2003 at 23:44 UTC

    I see. So you delete everything from the <script>(inclusive) up to the first non<script> tag(exclusive) that follows a </script> tag. Clever. But whether this helps or not I really don't know.

    It will strip any text that might follow the </script>, (which may not matter if they only have <script> in the <head>) but these probably do matter

    <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); document.write("Even the <body> tag is important!"); </script> </head> <body> This is just some text. </body> </html>
    or
    <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); if (x<y) { alert("y > x") } </script> </head> <body> This is just some text. </body> </html>

    You'd have to parse the JavaScript (at least to some extent to be able to say whether the </script> is meant to close it or not.

    Actually I guess you'd only have to distinguish three states inside the JavaScript. "Inside a singlequoted string", "Inside a doublequoted string" and "Elsewhere". And you'd only treat the <script> as the closing tag in the "Elsewhere".

    Jenda

Re: Re: Re: Removing Javascript
by osama (Scribe) on Jan 03, 2003 at 20:52 UTC
    Just a small change... I like it better like this:
    my $result; my $skip = 0; while (my $tok = $p->get_token) { my($ttype,$tag, $attr, $attrseq, $rawtxt) = @{ $tok }; $tag=lc $tag; $skip=1 if (($ttype eq 'S') && ($tag eq 'script')); if ((!$skip) && ($tag ne 'script')) { $result .= $rawtxt; $result .= $p->get_text; } $skip=0 if (($ttype eq 'E') && ($tag eq 'script')); }