in reply to Re^2: Scrubbing XML
in thread Scrubbing XML

Does the XML file format even care, at all, about newlines?

It depends what you mean.

To XML, newlines and carriage returns do not have special meaning. Stripping them from the document will not affect the validity of the document.

On the other hand, it will change the values of text nodes and attribute nodes. (Upd: Not completely true: See Re^7 ) That may or may not be desirable. The OP indicated he only wanted to remove <CR><LF> pairs and leave lone <LF> behind, which can be done using your technique.

Update: Replaced bad example with better explanation.

Replies are listed 'Best First'.
Re^4: Scrubbing XML
by Your Mother (Archbishop) on Jun 02, 2011 at 13:15 UTC

    Sorry to come back to this so late but I have a follow-up question: does it really change the meaning of attributes? I was under the impression that newlines (and presumably all whitespace but " ") were insignificant in XML attribute values; but I'm not certain. libxml seems to agree.

      They are definitely significant.

      value="foobar"
      is not the same as
      value="foo bar"

      And XML parsers do not return the same for

      value="foo bar"
      and
      value="foo bar"

      You might be thinking of HTML's whitespace collapsing rules. Even then, that's for rendering, and attributes aren't rendered.

        Well, that's not the meat of what I asked. I asked if newlines were significant. libxml does collapse \n and \r (but apparently not others like \t and \f) to plain spaces in attributes. I'm mostly curious is if this part of the standard or a quirk in libxml's handling.

        perl -MXML::LibXML -le'print XML::LibXML->new->parse_string(qq{<root a +ttr="a\nb"/>})->serialize' <?xml version="1.0"?> <root attr="a b"/>