Scrubbing XML

the.duck has asked for the wisdom of the Perl Monks concerning the following question:

I've been tasked with parsing some daily "xml" files and gathering data from them. I use "xml" as it is an abomination with a .xml extension. The issue is that in the file there are MANY Windows newlines intermixed with the valid Unix line feeds. This results in things like:

<FormattedReportObjects>
<FormattedReportObject 
xsi:type="FormattedField" Type="xsd:long" FieldName="{Sum_ttx.
E
v
e
n
t
I
D
}
"
>
<ObjectName>Field2</ObjectName>
<FormattedValue>0</FormattedValue>
[download]

Where the newlines that LOOK correct are line feeds and where there is a line per character is a <CR><LF>.

Anyone have any ideas on how I could fix this? (The obvious "make your XML valid" has been tried and failed) I've tried tr, sed, and perl one liners, all to no avail. E.G.

perl -ne ' s/\r\n?//g; print ' foo.xml 
sed -e s/^M\n//g foo.xml
tr -d ^M\n foo.xml
[download]

I appreciate any help anyone can provide. Thanks.

Comment on Scrubbing XML Select or Download Code

Replies are listed 'Best First'.
Re: Scrubbing XML by anonymized user 468275 (Curate) on Apr 18, 2011 at 16:03 UTC
On some unix systems you could pass the file through the dos2unix facility, e.g. `dos2unix < foo.xml > fooOK.xml` [download] If that is missing, or if it still doesn't work, I'd try a hardcoded (into binary) version of the regexp, e.g.: `for my $hardcoded ( chr(13) . chr(10), chr(10) . chr(13)) { s/$hardcoded//g: }` [download] One world, one people	[reply] [d/l] [select]
Re^2: Scrubbing XML by the.duck (Novice) on Apr 18, 2011 at 16:12 UTC
Well the problem with dos2unix or anything that just removes the carriage return is that I'm left with extra line feeds. I need to when a I see a <CR> also remove the <LF> without removing all the other <LF>'s.	[reply]
Re^3: Scrubbing XML by anonymized user 468275 (Curate) on Apr 18, 2011 at 16:24 UTC
I anticipated that, hence the hardcoded regexp idea, but I just remembered something else -- you might need to set $/ = undef() as well as the hardcoded regexp, to prevent the CR and LF being split across a line break. Update: and if using perl -ne, that would have to be done in a BEGIN{ } block One world, one people	[reply]
Re: Scrubbing XML by cdarke (Prior) on Apr 18, 2011 at 16:00 UTC
"\n" is usually ignored in regular expressions, unless the /s flag is set at the end (as you would the /g flag). However in this case I would have thought that `s/\r//g` would be sufficient. Update: although maybe `s/\r\n?//s;` is needed here (I'm having difficulty in reproducing the data here).	[reply] [d/l] [select]
Re^2: Scrubbing XML by Your Mother (Archbishop) on Apr 18, 2011 at 16:41 UTC
`\n` itself isn't ignored, what you're thinking of is `.` which doesn't include `\n` without the `s` flag.	[reply] [d/l] [select]
Re: Scrubbing XML by halfcountplus (Hermit) on Apr 18, 2011 at 16:00 UTC
If you can do without any newlines at all: `s/\s+/ /g;` [download] "\s" includes \r and \n.	[reply] [d/l]
Re: Scrubbing XML by Anonymous Monk on Apr 18, 2011 at 18:47 UTC
Surely you can slurp the file into memory, change the line-ends problem however you need to, and then process the resulting data normally?	[reply]
Re^2: Scrubbing XML by locked_user sundialsvc4 (Abbot) on Apr 18, 2011 at 19:06 UTC
`/me nods...` Does the XML file format even care, at all, about newlines? My vague recollection is that it does not. Why can’t you just read the file, use an `s///g...` regex to stomp them all out, and process the resulting string? (For most computers these days, processing several megabytes “as a string” is no big deal anymore.)
Re^3: Scrubbing XML by ikegami (Patriarch) on Apr 18, 2011 at 19:17 UTC
Does the XML file format even care, at all, about newlines? It depends what you mean. To XML, newlines and carriage returns do not have special meaning. Stripping them from the document will not affect the validity of the document. On the other hand, it will change the values of text nodes and attribute nodes. (Upd: Not completely true: See Re^7 ) That may or may not be desirable. The OP indicated he only wanted to remove <CR><LF> pairs and leave lone <LF> behind, which can be done using your technique. Update: Replaced bad example with better explanation.	[reply]
Re^4: Scrubbing XML by Your Mother (Archbishop) on Jun 02, 2011 at 13:15 UTC
Re^5: Scrubbing XML by ikegami (Patriarch) on Jun 02, 2011 at 15:59 UTC
Some notes below your chosen depth have not been shown here
Re^3: Scrubbing XML by the.duck (Novice) on Apr 18, 2011 at 19:48 UTC
I don't think it does, and I've started to work with it that way. Makes it a tad hard to read (for verifying my program), but once I've coded it, what do I care? Thank you for your response! Jen, if your computer was a person I'd shoot it in the face.	[reply]
Re^4: Scrubbing XML by ikegami (Patriarch) on Apr 18, 2011 at 19:58 UTC