Mur has asked for the wisdom of the Perl Monks concerning the following question:

Does anyone have a quick'n'dirty (yet oh so reliable!) way to remove everything between <script> and </script> tags? Of course, it should match minimally (that is, it should not fail here:
<script language="Javascript"> foo bar foo bar </script> You fail if you remove this line! <script language="Javascript"> bar foo bar foo </script>
And it should handle embedded things that aren't really close tags:
<script language="Javascript"> echo "Always use </script>!"; </script>
--
Jeff Boes
Database Engineer
Nexcerpt, Inc.
vox 269.226.9550 ext 24
fax 269.349.9076
 http://www.nexcerpt.com
...Nexcerpt...Connecting People With Expertise

Replies are listed 'Best First'.
Re: Removing Javascript
by Ovid (Cardinal) on Jan 02, 2003 at 21:03 UTC

    HTML::TokeParser::Simple to the rescue. You said you wanted to remove everything "between" the tags, so I'm leaving the tags in. This should be relatively easy to fix if you want to also strip the script tags.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser::Simple 1.4; my $parser = HTML::TokeParser::Simple->new( *DATA ); my $html = ''; my $is_script = 0; while ( my $token = $parser->get_token ) { $html .= $token->as_is unless $is_script; if ( $token->is_start_tag('script') ) { $is_script = 1; } elsif ( $token->is_end_tag('script') ) { $is_script = 0; $html .= $token->as_is; } } print $html; __DATA__ <title>foobar</title> <script language="Javascript"> foo bar foo bar </script> You fail if you remove this line! <script language="Javascript"> bar foo bar foo </script>

    Cheers,
    Ovid

    New address of my CGI Course.
    Silence is Evil (feel free to copy and distribute widely - note copyright text)

Re: Removing Javascript
by cLive ;-) (Prior) on Jan 03, 2003 at 01:02 UTC
    Just a quickie here. If your intention is to strip out all javascript, then you have your work cut out. There's not just the script tags - there's the onMouseovers, onLoad and all the other event handlers.

    Without knowing the context in which you want to strip - security? - it's hard to suggest best method.

    What I would do, if possible, is work the other way round. Define a list of HTML tags/attributes that are valid and strip out everthing else.

    .02

    cLive ;-)

      You are right. HTML::TagFilter will help with this.

      Jenda

      P.S.: I have something similar here. I planned to release it as HTML::TagFilter when I'm satisfied with the code, but William was quicker :-)

Re: Removing Javascript
by joe++ (Friar) on Jan 02, 2003 at 20:22 UTC
    Hi Mur (Jeff?),

    The problem here is - as always - the behaviour of some web browsers, that actually render the most crappy html/javascript code when they really shouldn't.

    BTW, I'm intrpreting your second code example as being Javascript, rather than PHP (what it more looks like), so it would read like this:

    <script language="Javascript"> document.write("Always use </script>!"); </script>
    Anyway, in order to cope with this kind of crap I normally try a combination of HTML Tidy and/or LibXML's xmllint with the proper flags to accept and correct malformed html as input.

    Then, if your source can be corrected this way and converted into well formed xhtml, you even can think of applying a very minimal XSLT stylesheet which leaves out all the <script/> elements.

    (yes, I know, this has nothing to do with Perl, but this is how I solved this kind of problems many times already).

    --
    Cheers, Joe

Re: Removing Javascript
by Ionizor (Pilgrim) on Jan 02, 2003 at 20:25 UTC

    It sounds like you're looking for a regex. 99% of the time slicing up HTML with regexes is the wrong thing to do. I would try looking into HTML::TokeParser or similar modules to do this.

    If you're using XHTML you can use an XML parser such as XML::Simple or XML::Parser but that may be overkill.

Re: Removing Javascript
by chromatic (Archbishop) on Jan 02, 2003 at 20:27 UTC
Re: Removing Javascript
by Mur (Pilgrim) on Jan 02, 2003 at 20:36 UTC
    HTML::Parser isn't the answer.
    #!/usr/bin/perl -w use strict; use English; use warnings; use HTML::Parser; { package JavascriptIsBad; use base 'HTML::Parser'; my $result; my $skipping = 0; sub start { my($self, $tagname, $attr, $attrseq, $origtext) = @_; if (lc($tagname) eq 'script') { $skipping = 1; } $result .= $origtext unless $skipping; } sub end { my($self, $tagname, $origtext) = @_; $result .= $origtext unless $skipping; if (lc($tagname) eq 'script' and $skipping) { $skipping = 0; } } sub text { my($self, $origtext, $is_cdata) = @_; return if $skipping; $result .= $origtext; return; } sub result { $result } } my $p = JavascriptIsBad->new; $p->parse(<<EOF); <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); </script> </head> <body> This is just some text. </body> </html> EOF print 'Result: ', $p->result, "\n";
    gets fooled by the first </script> tag.
    --
    Jeff Boes
    Database Engineer
    Nexcerpt, Inc.
    vox 269.226.9550 ext 24
    fax 269.349.9076
     http://www.nexcerpt.com
    ...Nexcerpt...Connecting People With Expertise
      Ah! But TokeParser gets me where I want to go.
      #!/usr/bin/perl -w use strict; use English; use warnings; use HTML::TokeParser; my $doc = <<EOF; <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); </script> </head> <body> This is just some text. </body> </html> EOF my $p = HTML::TokeParser->new(\$doc); my $result; my $skipping = 0; while (my $tok = $p->get_token) { if ($tok->[0] eq 'S') { if (lc($tok->[1]) eq 'script') { $skipping = 1; } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } elsif ($tok->[0] eq 'E') { if (lc($tok->[1]) eq 'script') { $skipping = 0; } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } elsif (!$skipping) { $result .= $tok->[-1]; $result .= $p->get_text; } } print 'Result: ', $result, "\n";
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise

        I see. So you delete everything from the <script>(inclusive) up to the first non<script> tag(exclusive) that follows a </script> tag. Clever. But whether this helps or not I really don't know.

        It will strip any text that might follow the </script>, (which may not matter if they only have <script> in the <head>) but these probably do matter

        <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); document.write("Even the <body> tag is important!"); </script> </head> <body> This is just some text. </body> </html>
        or
        <html> <head> <script language="Javascript"> document.write("Don't forget your </script> tag! It's important!"); if (x<y) { alert("y > x") } </script> </head> <body> This is just some text. </body> </html>

        You'd have to parse the JavaScript (at least to some extent to be able to say whether the </script> is meant to close it or not.

        Actually I guess you'd only have to distinguish three states inside the JavaScript. "Inside a singlequoted string", "Inside a doublequoted string" and "Elsewhere". And you'd only treat the <script> as the closing tag in the "Elsewhere".

        Jenda

        Just a small change... I like it better like this:
        my $result; my $skip = 0; while (my $tok = $p->get_token) { my($ttype,$tag, $attr, $attrseq, $rawtxt) = @{ $tok }; $tag=lc $tag; $skip=1 if (($ttype eq 'S') && ($tag eq 'script')); if ((!$skip) && ($tag ne 'script')) { $result .= $rawtxt; $result .= $p->get_text; } $skip=0 if (($ttype eq 'E') && ($tag eq 'script')); }
      gets fooled by the first </script> tag.
      Something's getting fooled, but it's not HTML::Parser.
      You should be using the appropriate HTML entities in your javascript ( &lt; and &gt;) instead of using the actual closing tag in your document.write statement.
        That would be nice if I could force it on all the web page authors out there. But I can't. I'm parsing "real-world" HTML, and apparently browsers don't flag this as bad. 8-(
        --
        Jeff Boes
        Database Engineer
        Nexcerpt, Inc.
        vox 269.226.9550 ext 24
        fax 269.349.9076
         http://www.nexcerpt.com
        ...Nexcerpt...Connecting People With Expertise
Re: Removing Javascript
by dmitri (Priest) on Jan 02, 2003 at 20:56 UTC
    You should not try to do more than a browser would. No browser will parse
    <script language="Javascript"> echo "Always use </script>!"; </script>
    correctly. This is how this is usually written:
    <script language="Javascript"> echo "Always use <\/script>!"; </script>
Re: Removing Javascript
by jacques (Priest) on Jan 03, 2003 at 02:14 UTC
    s/<(?:[^>'"]*|".*?"|'.*?')+script(?:[^<'"]*|".*?"|'.*?')+>.*?<(?:[^>'" +]*|".*?"|'.*?')+\/.*?script(?:[^<'"]*|".*?"|'.*?')+>/<script><\/scrip +t>/igsx;

    Quick 'n dirty. But Ovid's solution is more reliable. Also I totally agree with cLive.

Re: Removing Javascript
by domm (Chaplain) on Jan 04, 2003 at 13:00 UTC
    Jet another way, using HTML::Tree:

    Please note that this doesn't handle the unescaped closing script tag in document.write. I'd suggest running tidy on the input before passing it to the parser.

    #!/usr/bin/perl -w use strict; use HTML::Tree; my $doc = <<EOF; <html> <head> <script language="Javascript"> document.write("Don't forget your &lt;/script&gt; tag! It's important! +"); </script> </head> <body> This is just some text. </body> </html> EOF my $root=HTML::TreeBuilder->new(); $root->parse($doc); $root->eof; foreach my $n ($root->descendants) { next unless $n->tag; # skip text nodes $n->delete if $n->tag eq 'script'; } print $root->dump; # prints structure print $root->as_HTML # prints as HTML
    -- #!/usr/bin/perl for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}