in reply to Re^5: Scrubbing XML
in thread Scrubbing XML

Well, that's not the meat of what I asked. I asked if newlines were significant. libxml does collapse \n and \r (but apparently not others like \t and \f) to plain spaces in attributes. I'm mostly curious is if this part of the standard or a quirk in libxml's handling.

perl -MXML::LibXML -le'print XML::LibXML->new->parse_string(qq{<root a +ttr="a\nb"/>})->serialize' <?xml version="1.0"?> <root attr="a b"/>

Replies are listed 'Best First'.
Re^7: Scrubbing XML
by ikegami (Patriarch) on Jun 02, 2011 at 17:00 UTC

    Ah! News to me! If libxml does it, I'm sure it's in the spec. And here it is: 3.3.3 Attribute-Value Normalization.

    1. \r\n is is converted to \n. (Done for the entire document.)
    2. Entity references (e.g. &eacute;) are interpolated.
    3. \r, \n and \t are converted to spaces.
    4. Character references (e.g. &#xE9;) are interpolated.

      Nice. Looks like yet another example of the insanity of the XML specification authors. It's good to know XML parsers MUST corrupt the value of attributes. Really guys ... how many of you knew you have to escape your tabs and newlines when including the data in XML attributes? And which XML generators do that?

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.