Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I need to generate a document that can pass XHTML strict 1.0 from raw text.

Since those texts may contain &s, >, < those character that will get complain from the validator, I need to process them (escape), but I don't want to use very simple regexp to escape all & into &amp; as it will kill any thing written in &xxx; and &#XXXXX;.

Should I use regular expression or a modulus to help me doing this?

Replies are listed 'Best First'.
Re: HTML escape char?
by pc88mxer (Vicar) on Mar 25, 2008 at 05:30 UTC
    Use the encode_entities function from the HTML::Entities module:
    use HTML::Entities; my $text = ...; my $encoded_text = encode_entities($text);
    If you want to force the use of numeric (hexadecimal) character references, you can use the encode_entities_numeric function.

    The other issue is that you shouldn't be mixing text and entity encoded text in the same string. For instance, if you have a string with contents &amp;, you should first decode that string so that it is represented by the single character string &. Then after you are done manipulating it, re-encode it to use entity references using the encode_entities function. From the way you describe your problem, it sounds like you could be running into this issue. If you describe more about how your XML is being generated, we can tell you if indeed this is the case or not.

Re: HTML escape char?
by Your Mother (Archbishop) on Mar 25, 2008 at 06:46 UTC

    Always a module for this kind of thing. It might feel like more work to set up but you will save yourself hours of learning the hard way otherwise.

    You'll also need well formed block tags around your text. Strict XHTML doesn't allow naked flow tags in the body. There are a lot of ways to do this, the one I like a lot is using XML::LibXML as an XHTML filter/writer.

    Semi-tested minimalist version. You'll have to manipulate/add headers and the <html/> yourself-

    use strict; use warnings; use XML::LibXML; local $/ = "\n\n"; my $doc = XML::LibXML::Document->createDocument(); my $root = $doc->createElement("body"); $doc->setDocumentElement( $root ); while ( my $para = <DATA> ) { chomp $para; my $p = $doc->createElement("p"); my $txt = $doc->createTextNode( $para ); $p->appendChild($txt); $root->appendChild($p); } print $doc->serialize(1); __DATA__ Some arbitrary text > some fixed text & that's okay. XHTML is fun with L<XML::LibXML>.
    Yields-
    <?xml version="1.0"?> <body> <p>Some arbitrary text &gt; some fixed text &amp; that's okay.</p> <p>XHTML is fun with L&lt;XML::LibXML&gt;. </p> </body>
    Update: LibXML can be a bit daunting. Here is a quicky you can use to just get the block level (the <p/>s) tags out of the document. print $_->serialize(1), $/ for $doc->getDocumentElement->childNodes;