throop has asked for the wisdom of the Perl Monks concerning the following question:

Brethren

In CGI forms, users type in text. Sometimes, users type in pseudo-HTML, like <I>, <BR>, <P> and all the other Perl Monks Approved HTML tags. Applications then display the typein, honoring the tags. But that text has to be cleaned up. To start with, unclosed tags have to be closed. But only certain tags should be allowed through – don't want <script> tags injecting javascript.

Is there a module for 'normalizing' the pseudo-HTML that a user types into a CGI? I didn't have much luck at CPAN, but I might not have been using the right search terms.

PerlMonks itself is a patched version of Everything2. I may muck through it and see what I can find. But I was hoping someone could point me to a standalone module.

thanks
throop

(Thanks to ikegami, snowhare and others in the CB who helped me articulate this.)
  • Comment on CGI Typein box for pseudo-HTML - is there a module?

Replies are listed 'Best First'.
Re: CGI Typein box for pseudo-HTML - is there a module?
by erroneousBollock (Curate) on Sep 27, 2007 at 15:47 UTC
    On the client-side, there are various javascript DOM APIs for dealing with HTML fragments, or you can let the browser do it implicitly (it has a very fast HTML parser :-).

    A trick I've done in the past to "balance" unbalanced HTML-source is to:

    1. create a placeholder div with document.createElement (don't bother to attach it to the document),
    2. assign the "unbalanced" text to the innerHTML property of the new DIV, and
    3. retrieve the innerHTML back from the div.

    If that doesn't work for you, you might try the Range.cloneContents method (the standard says this does balance "unbalanced" tags for you).

    Either of those can be done in (for example) a submit handler.

    On the perl side you can use something like HTML::PullParser, which gives you a lot of control over what you wish to ignore.

    -David

Re: CGI Typein box for pseudo-HTML - is there a module?
by zby (Vicar) on Sep 27, 2007 at 15:38 UTC
Re: CGI Typein box for pseudo-HTML - is there a module? (PM code)
by tye (Sage) on Sep 27, 2007 at 16:22 UTC

    I posted the HTML filtering code, Re: Proper nesting of HTML to be enforced (the code). The way "preview" is implemented is all tangled in with parts of the PerlMonks templating making it quite a mess while implementing it for most sites would be pretty trivial.

    And you should probably redirect after POST so that duplicates are less of a problem.

    - tye        

Re: CGI Typein box for pseudo-HTML - is there a module?
by Cody Pendant (Prior) on Sep 28, 2007 at 02:00 UTC
    Off-topic somewhat, but I'm curious: why are you calling it "pseudo" HTML?


    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
      1. It lacks <HTML> and <BODY> tags.
      2. It doesn't honor many tags and attributes that are part of the HTML standard (e.g., filtering out <IMG>)
      3. Oftentimes, it infers markups without tags (e.g. putting in a <P> where the user enters successive CRs.)
      4. At many sites, the HTML tags are supplemented with some other markup language (e.g. Wikipedia).
      throop