in reply to Re^4: Scrubbing XML
in thread Scrubbing XML

They are definitely significant.

value="foobar"
is not the same as
value="foo bar"

And XML parsers do not return the same for

value="foo bar"
and
value="foo bar"

You might be thinking of HTML's whitespace collapsing rules. Even then, that's for rendering, and attributes aren't rendered.

Replies are listed 'Best First'.
Re^6: Scrubbing XML
by Your Mother (Archbishop) on Jun 02, 2011 at 16:12 UTC

    Well, that's not the meat of what I asked. I asked if newlines were significant. libxml does collapse \n and \r (but apparently not others like \t and \f) to plain spaces in attributes. I'm mostly curious is if this part of the standard or a quirk in libxml's handling.

    perl -MXML::LibXML -le'print XML::LibXML->new->parse_string(qq{<root a +ttr="a\nb"/>})->serialize' <?xml version="1.0"?> <root attr="a b"/>

      Ah! News to me! If libxml does it, I'm sure it's in the spec. And here it is: 3.3.3 Attribute-Value Normalization.

      1. \r\n is is converted to \n. (Done for the entire document.)
      2. Entity references (e.g. &eacute;) are interpolated.
      3. \r, \n and \t are converted to spaces.
      4. Character references (e.g. &#xE9;) are interpolated.

        Nice. Looks like yet another example of the insanity of the XML specification authors. It's good to know XML parsers MUST corrupt the value of attributes. Really guys ... how many of you knew you have to escape your tabs and newlines when including the data in XML attributes? And which XML generators do that?

        Jenda
        Enoch was right!
        Enjoy the last years of Rome.