in reply to Re^3: Scrubbing XML
in thread Scrubbing XML

Sorry to come back to this so late but I have a follow-up question: does it really change the meaning of attributes? I was under the impression that newlines (and presumably all whitespace but " ") were insignificant in XML attribute values; but I'm not certain. libxml seems to agree.

Replies are listed 'Best First'.
Re^5: Scrubbing XML
by ikegami (Patriarch) on Jun 02, 2011 at 15:59 UTC

    They are definitely significant.

    value="foobar"
    is not the same as
    value="foo bar"

    And XML parsers do not return the same for

    value="foo bar"
    and
    value="foo bar"

    You might be thinking of HTML's whitespace collapsing rules. Even then, that's for rendering, and attributes aren't rendered.

      Well, that's not the meat of what I asked. I asked if newlines were significant. libxml does collapse \n and \r (but apparently not others like \t and \f) to plain spaces in attributes. I'm mostly curious is if this part of the standard or a quirk in libxml's handling.

      perl -MXML::LibXML -le'print XML::LibXML->new->parse_string(qq{<root a +ttr="a\nb"/>})->serialize' <?xml version="1.0"?> <root attr="a b"/>

        Ah! News to me! If libxml does it, I'm sure it's in the spec. And here it is: 3.3.3 Attribute-Value Normalization.

        1. \r\n is is converted to \n. (Done for the entire document.)
        2. Entity references (e.g. &eacute;) are interpolated.
        3. \r, \n and \t are converted to spaces.
        4. Character references (e.g. &#xE9;) are interpolated.