comment on

Hi all,

Thanks in advance for looking.

I have some csv log files that come with embedded newlines ('\n's within double quotes). When I loop through them with a typical:

while(<$FILE>)

The loop sees the embedded newlines as "real" newlines and breaks my CSV line up into pieces. I am running the script on a Linux (RHEL) machine fwiw in regards to file system newlines.

I did a little bit of research and settled (somewhat unwillingly since I can't quite parse it) on using the following one-liner to remove these embedded newlines and it worked.

perl -F'' -0 -ane 'map {$_ eq q(") && {$seen=$seen?0:1}; $seen && $_ eq "\n" &&{$_=" "}; print} @F' filename.csv > filename.csv.tmp

Some of you probably already know where I'm headed with this but basically it chokes badly on larger files: throwing up "Out of Memory!" erorrs (the machine I'm running it on only has 4GB of memory).

So, getting down to it, I've been trying to turn the one-liner into a program that will read the file line by line and remove the embedded newlines but I'm running into the primary reason I'm trying to remove them -- that <> cannot distinguish between the embedded newlines and the "real" ones.

Has anyone run into this issue before? Is it possible to look at the file in chunks and remove the embedded ones rather than searching the entire thing at once?

Thanks again for looking!

In reply to Embedded Newlines or Converting One-Liner to Loop by mwb613

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.