This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
JavaScript is indeed a bad thing
by Corion (Patriarch) on Jun 01, 2000 at 16:11 UTC

    You seem to be testing the Perlmonks engine to the limit it seems :) but you are right indeed, not stripping the <SCRIPT> tags from HTML is bad, as I don't see any valid reason why people would want to see JavaScript in the posts here.

    I think the following code should strip all <SCRIPT> tags :

    my $post; $post =~ s!<SCRIPT(?: [^>]*)?>.*?</SCRIPT(?: [^>])?>!!imgs;
    My reason for the parentheses is that only a person with bad intent would want to use JavaScript in nodes anyway and could maybe try to trick the code into not stripping the script by adding some attributes to the closing part of the script tags.

    On another side, I think also <FONT> tags and other things (like color etc.) should be avoided. Maybe it would be better to have a positive list of allowed tags instead of allowing Everything and then banning some special tags...

    My list of "good" tags would be more or less the following :

    SectionAllowed tags
    Font manipulation B,I,U,TT,CODE,H1,H2,H3,H4,H5,H6
    LayoutTABLE,THEAD,TBODY,TR,TH,TD,CENTER,P,DIV,UL,OL,LI
    LinksA

    Also, the engine could maybe even check for ill-formed HTML, that is, unclosed tags. I hate it if somebody posts with <PRE> and then does not close the tag so that all subsequent text is rendered as preformatted in Courier New. But that one requires much more analysis I think - or maybe not. An idea from the top of my head :

    1. Have a list of paired tags
    2. For each start tag of a pair, search for a closing tag, and remove both.
    3. If no closing tag was found, append the matching closing tag (regardless of scope) to the submitted text. This will mess up the layout, but the person should have submitted wellformed HTML in the first place.

    This method is crude and maybe destroys more than it does good - maybe instead of fixing the HTML, the engine should simply return a warning like

    Your post contains what we consider bad HTML, please fix it.

    Update: vroom has a post about his position on HTML online now.

      Great idea about limiting the HTML

      You seem to have forgotten the <STRONG> and <EM> tags. Those of us who have been doing HTML for a while know that these are more correct than using <B> and <I>. <CENTER> should not be included as it has been deprecated.

      In addition, we should consider:

      • BLOCKQUOTE
      • PRE
      • DL, DT, and DD
      • HR (?? - touchy call)
      • IMG (subject to abuse - perhaps only higher level monks?)
      • SUB and SUP
      • TFOOT (since you already have THEAD)
      • VAR
      • Possible FORM, INPUT, SELECT, OPTION, TEXTAREA, etc. as some people have used these effectively in their posts already.
      This should be a start, and there are already modules that can help out with the matching of tags and filtering.
        The 'offer your reply' page should have a preview button :)

        STRONG was created by a false analogy. EM is for emphasis, that makes sense, but what does STRONG mean? In reality, STRONG and EM represent different levels of emphasis.

        Nir Dagan said on the www-html list:

        A common myth is that <strong> is better than <b> since it gives the user (or browser) the option to control the style better. This is wrong since <b> and <strong> all have the same syntax properties in HTML and admit the same style rules.
        Jon Roland Eriksson on the www-html mailing list:
        But a double <EM> sounds very close to <STRONG> to me.
        Please consider the text from RFC1866...

        5.7.1.3. Emphasis: EM
        The <EM> element indicates an emphasized phrase, typically rendered as italics.

        5.7.1.6. Strong Emphasis: STRONG
        The <STRONG> element indicates strong emphasis, typically rendered in bold.
        ....please note that both headlines up there addresses the same thing, just at two different levels of strength.
        However, in the case of I vs. EM, EM still wins. I sets the enclosed text in italics no matter the context. But this means <i> level 1<i> level 2</i></i> is rendered completely in italics making level 2 indistinquishable from level 1. EM does not have this problem: <em>level 1<em> level 2</em></em> renders level 1 in italics and level 2 in upright text.

        I'm sure you're thinking "who cares" by now... but it's important to note STRONG is not better than B while I is better than EM.

RE: JavaScript allowed in posts!
by BBQ (Curate) on Jun 02, 2000 at 09:09 UTC
    Well, we already have the tools to turn it into code displaying. I don't see why <SCRIPT> couldn't be converted into <CODE>! Script could be a synonym, and it would take care of the actual script tag altogether...

    #!/home/bbq/bin/perl
    # Trust no1!