CGI Typein box for pseudo-HTML - is there a module?

throop has asked for the wisdom of the Perl Monks concerning the following question:

Brethren

In CGI forms, users type in text. Sometimes, users type in pseudo-HTML, like <I>, <BR>, <P> and all the other Perl Monks Approved HTML tags. Applications then display the typein, honoring the tags. But that text has to be cleaned up. To start with, unclosed tags have to be closed. But only certain tags should be allowed through – don't want <script> tags injecting javascript.

Is there a module for 'normalizing' the pseudo-HTML that a user types into a CGI? I didn't have much luck at CPAN, but I might not have been using the right search terms.

PerlMonks itself is a patched version of Everything2. I may muck through it and see what I can find. But I was hoping someone could point me to a standalone module.

thanks
throop

(Thanks to ikegami, snowhare and others in the CB who helped me articulate this.)

Comment on CGI Typein box for pseudo-HTML - is there a module?

Replies are listed 'Best First'.
Re: CGI Typein box for pseudo-HTML - is there a module? by erroneousBollock (Curate) on Sep 27, 2007 at 15:47 UTC
On the client-side, there are various javascript DOM APIs for dealing with HTML fragments, or you can let the browser do it implicitly (it has a very fast HTML parser :-). A trick I've done in the past to "balance" unbalanced HTML-source is to: create a placeholder div with document.createElement (don't bother to attach it to the document), assign the "unbalanced" text to the innerHTML property of the new DIV, and retrieve the innerHTML back from the div. If that doesn't work for you, you might try the Range.cloneContents method (the standard says this does balance "unbalanced" tags for you). Either of those can be done in (for example) a submit handler. On the perl side you can use something like HTML::PullParser, which gives you a lot of control over what you wish to ignore. -David	[reply]
Re: CGI Typein box for pseudo-HTML - is there a module? by zby (Vicar) on Sep 27, 2007 at 15:38 UTC
You might look at HTML::Scrubber (an alternative approach would be to use HTML::BBCode or Text::Markdown or one of the multitude of others).	[reply]
Re: CGI Typein box for pseudo-HTML - is there a module? (PM code) by tye (Sage) on Sep 27, 2007 at 16:22 UTC
I posted the HTML filtering code, Re: Proper nesting of HTML to be enforced (the code). The way "preview" is implemented is all tangled in with parts of the PerlMonks templating making it quite a mess while implementing it for most sites would be pretty trivial. And you should probably redirect after POST so that duplicates are less of a problem. - tye	[reply]
Re: CGI Typein box for pseudo-HTML - is there a module? by Cody Pendant (Prior) on Sep 28, 2007 at 02:00 UTC
Off-topic somewhat, but I'm curious: why are you calling it "pseudo" HTML? Nobody says perl looks like line-noise any more kids today don't know what line-noise IS ...	[reply]
Re^2: Why I call it 'pseudo-HTML' by throop (Chaplain) on Sep 28, 2007 at 03:07 UTC
It lacks `<HTML>` and `<BODY>` tags. It doesn't honor many tags and attributes that are part of the HTML standard (e.g., filtering out `<IMG>`) Oftentimes, it infers markups without tags (e.g. putting in a `<P>` where the user enters successive CRs.) At many sites, the HTML tags are supplemented with some other markup language (e.g. Wikipedia). throop	[reply]