Re: Removing Javascript
by Ovid (Cardinal) on Jan 02, 2003 at 21:03 UTC
|
HTML::TokeParser::Simple to the rescue. You said you wanted to remove everything "between" the tags, so I'm leaving the tags in. This should be relatively easy to fix if you want to also strip the script tags.
#!/usr/bin/perl -w
use strict;
use HTML::TokeParser::Simple 1.4;
my $parser = HTML::TokeParser::Simple->new( *DATA );
my $html = '';
my $is_script = 0;
while ( my $token = $parser->get_token ) {
$html .= $token->as_is unless $is_script;
if ( $token->is_start_tag('script') ) {
$is_script = 1;
}
elsif ( $token->is_end_tag('script') ) {
$is_script = 0;
$html .= $token->as_is;
}
}
print $html;
__DATA__
<title>foobar</title>
<script language="Javascript">
foo bar foo bar
</script>
You fail if you remove this line!
<script language="Javascript">
bar foo bar foo
</script>
Cheers,
Ovid
New address of my CGI Course.
Silence is Evil (feel free to copy and distribute widely - note copyright text) | [reply] [d/l] |
Re: Removing Javascript
by cLive ;-) (Prior) on Jan 03, 2003 at 01:02 UTC
|
Just a quickie here. If your intention is to strip out all javascript, then you have your work cut out. There's not just the script tags - there's the onMouseovers, onLoad and all the other event handlers.
Without knowing the context in which you want to strip - security? - it's hard to suggest best method.
What I would do, if possible, is work the other way round. Define a list of HTML tags/attributes that are valid and strip out everthing else.
.02
cLive ;-) | [reply] |
|
|
You are right. HTML::TagFilter will help with this.
Jenda
P.S.: I have something similar here. I planned to release it as HTML::TagFilter when I'm satisfied with the code, but William was quicker :-)
| [reply] |
Re: Removing Javascript
by joe++ (Friar) on Jan 02, 2003 at 20:22 UTC
|
Hi Mur (Jeff?),
The problem here is - as always - the behaviour of some web browsers, that actually render the most crappy html/javascript code when they really shouldn't.
BTW, I'm intrpreting your second code example as being Javascript, rather than PHP (what it more looks like), so it would read like this:
<script language="Javascript">
document.write("Always use </script>!");
</script>
Anyway, in order to cope with this kind of crap I normally try a combination of HTML Tidy and/or LibXML's xmllint with the proper flags to accept and correct malformed html as input.
Then, if your source can be corrected this way and converted into well formed xhtml, you even can think of applying a very minimal XSLT stylesheet which leaves out all the <script/> elements.
(yes, I know, this has nothing to do with Perl, but this is how I solved this kind of problems many times already).
--
Cheers, Joe | [reply] [d/l] [select] |
Re: Removing Javascript
by Ionizor (Pilgrim) on Jan 02, 2003 at 20:25 UTC
|
It sounds like you're looking for a regex. 99% of the time slicing up HTML with regexes is the wrong thing to do. I would try looking into HTML::TokeParser or similar modules to do this.
If you're using XHTML you can use an XML parser such as XML::Simple or XML::Parser but that may be overkill.
| [reply] |
Re: Removing Javascript
by chromatic (Archbishop) on Jan 02, 2003 at 20:27 UTC
|
| [reply] |
Re: Removing Javascript
by Mur (Pilgrim) on Jan 02, 2003 at 20:36 UTC
|
HTML::Parser isn't the answer.
#!/usr/bin/perl -w
use strict;
use English;
use warnings;
use HTML::Parser;
{
package JavascriptIsBad;
use base 'HTML::Parser';
my $result;
my $skipping = 0;
sub start {
my($self, $tagname, $attr, $attrseq, $origtext) = @_;
if (lc($tagname) eq 'script') {
$skipping = 1;
}
$result .= $origtext unless $skipping;
}
sub end {
my($self, $tagname, $origtext) = @_;
$result .= $origtext unless $skipping;
if (lc($tagname) eq 'script' and $skipping) {
$skipping = 0;
}
}
sub text {
my($self, $origtext, $is_cdata) = @_;
return if $skipping;
$result .= $origtext;
return;
}
sub result { $result }
}
my $p = JavascriptIsBad->new;
$p->parse(<<EOF);
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
</script>
</head>
<body>
This is just some text.
</body>
</html>
EOF
print 'Result: ', $p->result, "\n";
gets fooled by the first </script> tag.
| -- |
| Jeff Boes |
| Database Engineer |
| Nexcerpt, Inc. |
|
|
|
...Nexcerpt...Connecting People With Expertise
|
| [reply] [d/l] [select] |
|
|
Ah! But TokeParser gets me where I want to go.
#!/usr/bin/perl -w
use strict;
use English;
use warnings;
use HTML::TokeParser;
my $doc = <<EOF;
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
</script>
</head>
<body>
This is just some text.
</body>
</html>
EOF
my $p = HTML::TokeParser->new(\$doc);
my $result;
my $skipping = 0;
while (my $tok = $p->get_token) {
if ($tok->[0] eq 'S') {
if (lc($tok->[1]) eq 'script') {
$skipping = 1;
} elsif (!$skipping) {
$result .= $tok->[-1];
$result .= $p->get_text;
}
} elsif ($tok->[0] eq 'E') {
if (lc($tok->[1]) eq 'script') {
$skipping = 0;
} elsif (!$skipping) {
$result .= $tok->[-1];
$result .= $p->get_text;
}
} elsif (!$skipping) {
$result .= $tok->[-1];
$result .= $p->get_text;
}
}
print 'Result: ', $result, "\n";
| -- |
| Jeff Boes |
| Database Engineer |
| Nexcerpt, Inc. |
|
|
|
...Nexcerpt...Connecting People With Expertise
|
| [reply] [d/l] [select] |
|
|
I see. So you delete everything from the <script>(inclusive) up to the first non<script> tag(exclusive) that follows a </script> tag. Clever. But whether this helps or not I really don't know.
It will strip any text that might follow the </script>, (which may not matter if they only have <script> in the <head>) but these probably do matter
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
document.write("Even the <body> tag is important!");
</script>
</head>
<body>
This is just some text.
</body>
</html>
or
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!");
if (x<y) { alert("y > x") }
</script>
</head>
<body>
This is just some text.
</body>
</html>
You'd have to parse the JavaScript (at least to some extent to be able to say whether the </script> is meant to close it or not.
Actually I guess you'd only have to distinguish three states inside the JavaScript. "Inside a singlequoted string", "Inside a doublequoted string" and "Elsewhere".
And you'd only treat the <script> as the closing tag in the "Elsewhere".
Jenda | [reply] [d/l] [select] |
|
|
Just a small change... I like it better like this:
my $result;
my $skip = 0;
while (my $tok = $p->get_token) {
my($ttype,$tag, $attr, $attrseq, $rawtxt) = @{ $tok };
$tag=lc $tag;
$skip=1 if (($ttype eq 'S') && ($tag eq 'script'));
if ((!$skip) && ($tag ne 'script')) {
$result .= $rawtxt;
$result .= $p->get_text;
}
$skip=0 if (($ttype eq 'E') && ($tag eq 'script'));
}
| [reply] [d/l] |
|
|
gets fooled by the first </script> tag.
Something's getting fooled, but it's not HTML::Parser.
You should be using the appropriate HTML entities in your javascript ( < and >) instead of using the actual closing tag in your document.write statement.
| [reply] |
|
|
That would be nice if I could force it on all the web page authors out there. But I can't. I'm parsing "real-world" HTML, and apparently browsers don't flag this as bad. 8-(
| -- |
| Jeff Boes |
| Database Engineer |
| Nexcerpt, Inc. |
|
|
|
...Nexcerpt...Connecting People With Expertise
|
| [reply] [d/l] |
Re: Removing Javascript
by dmitri (Priest) on Jan 02, 2003 at 20:56 UTC
|
You should not try to do more than a browser would. No browser will parse
<script language="Javascript">
echo "Always use </script>!";
</script>
correctly. This is how this is usually written:
<script language="Javascript">
echo "Always use <\/script>!";
</script>
| [reply] [d/l] [select] |
Re: Removing Javascript
by jacques (Priest) on Jan 03, 2003 at 02:14 UTC
|
s/<(?:[^>'"]*|".*?"|'.*?')+script(?:[^<'"]*|".*?"|'.*?')+>.*?<(?:[^>'"
+]*|".*?"|'.*?')+\/.*?script(?:[^<'"]*|".*?"|'.*?')+>/<script><\/scrip
+t>/igsx;
Quick 'n dirty. But Ovid's solution is more reliable. Also I totally agree with cLive. | [reply] [d/l] |
Re: Removing Javascript
by domm (Chaplain) on Jan 04, 2003 at 13:00 UTC
|
Jet another way, using HTML::Tree:
Please note that this doesn't handle the unescaped closing script tag in document.write. I'd suggest running tidy on the input before passing it to the parser.
#!/usr/bin/perl -w
use strict;
use HTML::Tree;
my $doc = <<EOF;
<html>
<head>
<script language="Javascript">
document.write("Don't forget your </script> tag! It's important!
+");
</script>
</head>
<body>
This is just some text.
</body>
</html>
EOF
my $root=HTML::TreeBuilder->new();
$root->parse($doc);
$root->eof;
foreach my $n ($root->descendants) {
next unless $n->tag; # skip text nodes
$n->delete if $n->tag eq 'script';
}
print $root->dump; # prints structure
print $root->as_HTML # prints as HTML
--
#!/usr/bin/perl
for(ref bless{},just'another'perl'hacker){s-:+-$"-g&&print$_.$/}
| [reply] [d/l] [select] |