in reply to JavaScript allowed in posts!

You seem to be testing the Perlmonks engine to the limit it seems :) but you are right indeed, not stripping the <SCRIPT> tags from HTML is bad, as I don't see any valid reason why people would want to see JavaScript in the posts here.

I think the following code should strip all <SCRIPT> tags :

my $post; $post =~ s!<SCRIPT(?: [^>]*)?>.*?</SCRIPT(?: [^>])?>!!imgs;
My reason for the parentheses is that only a person with bad intent would want to use JavaScript in nodes anyway and could maybe try to trick the code into not stripping the script by adding some attributes to the closing part of the script tags.

On another side, I think also <FONT> tags and other things (like color etc.) should be avoided. Maybe it would be better to have a positive list of allowed tags instead of allowing Everything and then banning some special tags...

My list of "good" tags would be more or less the following :

SectionAllowed tags
Font manipulation B,I,U,TT,CODE,H1,H2,H3,H4,H5,H6
LayoutTABLE,THEAD,TBODY,TR,TH,TD,CENTER,P,DIV,UL,OL,LI
LinksA

Also, the engine could maybe even check for ill-formed HTML, that is, unclosed tags. I hate it if somebody posts with <PRE> and then does not close the tag so that all subsequent text is rendered as preformatted in Courier New. But that one requires much more analysis I think - or maybe not. An idea from the top of my head :

  1. Have a list of paired tags
  2. For each start tag of a pair, search for a closing tag, and remove both.
  3. If no closing tag was found, append the matching closing tag (regardless of scope) to the submitted text. This will mess up the layout, but the person should have submitted wellformed HTML in the first place.

This method is crude and maybe destroys more than it does good - maybe instead of fixing the HTML, the engine should simply return a warning like

Your post contains what we consider bad HTML, please fix it.

Update: vroom has a post about his position on HTML online now.

Replies are listed 'Best First'.
Allowed/forbidden HTML
by turnstep (Parson) on Jun 04, 2000 at 18:30 UTC

    Great idea about limiting the HTML

    You seem to have forgotten the <STRONG> and <EM> tags. Those of us who have been doing HTML for a while know that these are more correct than using <B> and <I>. <CENTER> should not be included as it has been deprecated.

    In addition, we should consider:

    • BLOCKQUOTE
    • PRE
    • DL, DT, and DD
    • HR (?? - touchy call)
    • IMG (subject to abuse - perhaps only higher level monks?)
    • SUB and SUP
    • TFOOT (since you already have THEAD)
    • VAR
    • Possible FORM, INPUT, SELECT, OPTION, TEXTAREA, etc. as some people have used these effectively in their posts already.
    This should be a start, and there are already modules that can help out with the matching of tags and filtering.
      The 'offer your reply' page should have a preview button :)

      STRONG was created by a false analogy. EM is for emphasis, that makes sense, but what does STRONG mean? In reality, STRONG and EM represent different levels of emphasis.

      Nir Dagan said on the www-html list:

      A common myth is that <strong> is better than <b> since it gives the user (or browser) the option to control the style better. This is wrong since <b> and <strong> all have the same syntax properties in HTML and admit the same style rules.
      Jon Roland Eriksson on the www-html mailing list:
      But a double <EM> sounds very close to <STRONG> to me.
      Please consider the text from RFC1866...

      5.7.1.3. Emphasis: EM
      The <EM> element indicates an emphasized phrase, typically rendered as italics.

      5.7.1.6. Strong Emphasis: STRONG
      The <STRONG> element indicates strong emphasis, typically rendered in bold.
      ....please note that both headlines up there addresses the same thing, just at two different levels of strength.
      However, in the case of I vs. EM, EM still wins. I sets the enclosed text in italics no matter the context. But this means <i> level 1<i> level 2</i></i> is rendered completely in italics making level 2 indistinquishable from level 1. EM does not have this problem: <em>level 1<em> level 2</em></em> renders level 1 in italics and level 2 in upright text.

      I'm sure you're thinking "who cares" by now... but it's important to note STRONG is not better than B while I is better than EM.