the.duck has asked for the wisdom of the Perl Monks concerning the following question:

I've been tasked with parsing some daily "xml" files and gathering data from them. I use "xml" as it is an abomination with a .xml extension. The issue is that in the file there are MANY Windows newlines intermixed with the valid Unix line feeds. This results in things like:

<FormattedReportObjects> <FormattedReportObject xsi:type="FormattedField" Type="xsd:long" FieldName="{Sum_ttx. E v e n t I D } " > <ObjectName>Field2</ObjectName> <FormattedValue>0</FormattedValue>
Where the newlines that LOOK correct are line feeds and where there is a line per character is a <CR><LF>.

Anyone have any ideas on how I could fix this? (The obvious "make your XML valid" has been tried and failed) I've tried tr, sed, and perl one liners, all to no avail. E.G.

perl -ne ' s/\r\n?//g; print ' foo.xml sed -e s/^M\n//g foo.xml tr -d ^M\n foo.xml
I appreciate any help anyone can provide. Thanks.

Replies are listed 'Best First'.
Re: Scrubbing XML
by anonymized user 468275 (Curate) on Apr 18, 2011 at 16:03 UTC
    On some unix systems you could pass the file through the dos2unix facility, e.g.
    dos2unix < foo.xml > fooOK.xml
    If that is missing, or if it still doesn't work, I'd try a hardcoded (into binary) version of the regexp, e.g.:
    for my $hardcoded ( chr(13) . chr(10), chr(10) . chr(13)) { s/$hardcoded//g: }

    One world, one people

      Well the problem with dos2unix or anything that just removes the carriage return is that I'm left with extra line feeds. I need to when a I see a <CR> also remove the <LF> without removing all the other <LF>'s.

        I anticipated that, hence the hardcoded regexp idea, but I just remembered something else -- you might need to set $/ = undef() as well as the hardcoded regexp, to prevent the CR and LF being split across a line break.

        Update: and if using perl -ne, that would have to be done in a BEGIN{ } block

        One world, one people

Re: Scrubbing XML
by cdarke (Prior) on Apr 18, 2011 at 16:00 UTC
    "\n" is usually ignored in regular expressions, unless the /s flag is set at the end (as you would the /g flag).
    However in this case I would have thought that s/\r//g would be sufficient.

    Update: although maybe s/\r\n?//s; is needed here (I'm having difficulty in reproducing the data here).

      \n itself isn't ignored, what you're thinking of is . which doesn't include \n without the s flag.

Re: Scrubbing XML
by halfcountplus (Hermit) on Apr 18, 2011 at 16:00 UTC
    If you can do without any newlines at all:
    s/\s+/ /g;
    "\s" includes \r and \n.
Re: Scrubbing XML
by Anonymous Monk on Apr 18, 2011 at 18:47 UTC
    Surely you can slurp the file into memory, change the line-ends problem however you need to, and then process the resulting data normally?

      /me nods...

      Does the XML file format even care, at all, about newlines?   My vague recollection is that it does not.   Why can’t you just read the file, use an s///g... regex to stomp them all out, and process the resulting string?   (For most computers these days, processing several megabytes “as a string” is no big deal anymore.)

        Does the XML file format even care, at all, about newlines?

        It depends what you mean.

        To XML, newlines and carriage returns do not have special meaning. Stripping them from the document will not affect the validity of the document.

        On the other hand, it will change the values of text nodes and attribute nodes. (Upd: Not completely true: See Re^7 ) That may or may not be desirable. The OP indicated he only wanted to remove <CR><LF> pairs and leave lone <LF> behind, which can be done using your technique.

        Update: Replaced bad example with better explanation.

        I don't think it does, and I've started to work with it that way. Makes it a tad hard to read (for verifying my program), but once I've coded it, what do I care? Thank you for your response!

        Jen, if your computer was a person I'd shoot it in the face.