in reply to Re: Scrubbing XML
in thread Scrubbing XML

/me nods...

Does the XML file format even care, at all, about newlines?   My vague recollection is that it does not.   Why can’t you just read the file, use an s///g... regex to stomp them all out, and process the resulting string?   (For most computers these days, processing several megabytes “as a string” is no big deal anymore.)

Replies are listed 'Best First'.
Re^3: Scrubbing XML
by ikegami (Patriarch) on Apr 18, 2011 at 19:17 UTC

    Does the XML file format even care, at all, about newlines?

    It depends what you mean.

    To XML, newlines and carriage returns do not have special meaning. Stripping them from the document will not affect the validity of the document.

    On the other hand, it will change the values of text nodes and attribute nodes. (Upd: Not completely true: See Re^7 ) That may or may not be desirable. The OP indicated he only wanted to remove <CR><LF> pairs and leave lone <LF> behind, which can be done using your technique.

    Update: Replaced bad example with better explanation.

      Sorry to come back to this so late but I have a follow-up question: does it really change the meaning of attributes? I was under the impression that newlines (and presumably all whitespace but " ") were insignificant in XML attribute values; but I'm not certain. libxml seems to agree.

        They are definitely significant.

        value="foobar"
        is not the same as
        value="foo bar"

        And XML parsers do not return the same for

        value="foo bar"
        and
        value="foo bar"

        You might be thinking of HTML's whitespace collapsing rules. Even then, that's for rendering, and attributes aren't rendered.

Re^3: Scrubbing XML
by the.duck (Novice) on Apr 18, 2011 at 19:48 UTC

    I don't think it does, and I've started to work with it that way. Makes it a tad hard to read (for verifying my program), but once I've coded it, what do I care? Thank you for your response!

    Jen, if your computer was a person I'd shoot it in the face.
      You can always "pretty print" it. XML::Twig installs xml_pp that does just that. Note that it adds spaces which you might have to trim out later if you pass the prettied version to your parser.