mwb613 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

Thanks in advance for looking.

I have some csv log files that come with embedded newlines ('\n's within double quotes). When I loop through them with a typical:

 while(<$FILE>)

The loop sees the embedded newlines as "real" newlines and breaks my CSV line up into pieces. I am running the script on a Linux (RHEL) machine fwiw in regards to file system newlines.

I did a little bit of research and settled (somewhat unwillingly since I can't quite parse it) on using the following one-liner to remove these embedded newlines and it worked.

perl -F'' -0 -ane 'map {$_ eq q(") && {$seen=$seen?0:1}; $seen && $_ eq "\n" &&{$_=" "}; print} @F' filename.csv > filename.csv.tmp

Some of you probably already know where I'm headed with this but basically it chokes badly on larger files: throwing up "Out of Memory!" erorrs (the machine I'm running it on only has 4GB of memory).

So, getting down to it, I've been trying to turn the one-liner into a program that will read the file line by line and remove the embedded newlines but I'm running into the primary reason I'm trying to remove them -- that <> cannot distinguish between the embedded newlines and the "real" ones.

Has anyone run into this issue before? Is it possible to look at the file in chunks and remove the embedded ones rather than searching the entire thing at once?

Thanks again for looking!

Replies are listed 'Best First'.
Re: Embedded Newlines or Converting One-Liner to Loop
by Cristoforo (Curate) on Dec 15, 2016 at 00:58 UTC

      Right... I'm just seeing the old thread (from 2003) now

      I think I ignored it before because my log processor doesn't just work with CSVs but I think I can probably work around that.

      Thanks!

        Hi mwb613,

        I second Text::CSV as "the" solution for reading CSV files, especially if they've got things like embedded newlines. It would also help in the case that the input format changes, since Text::CSV is quite flexible.

        my log processor doesn't just work with CSVs but I think I can probably work around that

        One solution would be to abstract out the handling of records, so that it doesn't matter where they come from:

        my $csv = Text::CSV->new({ binary => 1, eol => $/ }) or die Text::CSV->error_diag; open my $fh, "<", $file or die "$file: $!"; while ( my $row = $csv->getline($fh) ) { handle_row($row); } $csv->eof or $csv->error_diag(); close $fh; sub handle_row { # ... }

        (Untested, mostly a copy-n-paste from Text::CSV.)

        Hope this helps,
        -- Hauke D

Re: Embedded Newlines or Converting One-Liner to Loop
by LanX (Saint) on Dec 15, 2016 at 04:01 UTC
    If your only intention is to identify "real" lines, you could get them by joining "fake" lines.

    All lines are fake as long the sum of all double quotes so far is odd.

    This works as long doublequotes are escaped by doubling them.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Thanks Rolf!

      I believe I was able to make your idea work. I've posted a solution below, feel free to critique if you have the notion.

      my $wait_for_odd_quotes = 0; my $line_accumulator = ''; while(<$CSVFILE>) { chomp(my $this_line = $_); my @matches = $this_line =~ /(\")/g; my $count = @matches; if($wait_for_odd_quotes == 0){ if($count % 2 == 1){ $line_accumulator = $this_line; #Reset Accumulator $wait_for_odd_quotes = 1; #Prime next loop to look for end + of quotes } else { print $OUTFILE $this_line . "\n"; #We are not looking for +and end quote and this line doesn't have an odd number of quotes so w +e'll write it to file } } else { if($count % 2 == 1){ $line_accumulator .= $this_line; #matched our open quotes, + taking this last bit $wait_for_odd_quotes = 0; #reset so next loop knows we're +not looking to close print $OUTFILE $line_accumulator . "\n"; } else { $line_accumulator .= ' ' . $this_line; } } } print $OUTFILE $line_accumulator . "\n"; #catch final line if it had e +mbedded newlines
        Another implemenation based on Rolf's suggestion:
        #!/usr/bin/perl use strict; use warnings; while (my $line = get_CSVline()) { print "$line\n"; } sub get_CSVline { my $buffer; while (!defined($buffer) or is_odd_quotes($buffer) ) { my $temp =<DATA>; $buffer .= $temp; } chomp $buffer; $buffer =~ s/\n/\\n/g; #### make "\n" "visible" ### return $buffer; } sub is_even_quotes { my $string = shift; return !( ($string=~tr/"//) % 2); } sub is_odd_quotes { my $string = shift; return ( ($string=~tr/"//) % 2); } =PRINTS: 1,2,3.3,"\n",4,5 6,7,8,9 6,"\n",7,8 a,b,c,"something\nmore" 1,2,3 1,"x\n","y","z\n",3 "3.5 "" disks" =cut __DATA__ 1,2,3.3," ",4,5 6,7,8,9 6," ",7,8 a,b,c,"something more" 1,2,3 1,"x ","y","z ",3 "3.5 "" disks"
        Update: Added one more test case.
        Added the '"3.5 "" disks"' test case.

        Regarding LanX's suggestion: Remove Tabs and Newlines Inside Fields of Text Tab Delimited Files from Excel.

        Untested for comma delimiters, but presumably you would just have to change all the split and join lines to use commas instead of tabs.

        Also, from another post: Do I have to trick Split? Using "-1" for the third split parameter fixed some warnings trailing empty data columns caused during the subsequent join.

        my @data = resolve_comma_delimited_file_line($CSVFILE, $this_line); # This subroutine accepts a filehandle and a line read from that fileh +andle as arguments given in that order. # If necessary it will modify the line that was passed to it (as if pa +ssed by reference) to resolve it, and return an array of the split da +ta. sub resolve_comma_delimited_file_line { my $fh = $_[0]; chomp($_[1]); # $_[1] being the read line passed in to this subrou +tine that is to be modified if necessary (as if passed by reference) my @data = split /,/, $_[1], -1; my $last_index = $#data; for (my $field_index=0; $field_index<$last_index; $field_index++) { if (($data[$field_index] =~ tr/"//) % 2 == 1) { splice @data, $field_index, 2, "$data[$field_index] $data[ +$field_index+1]"; $_[1] = join ",", @data; $last_index--; $field_index--; } } if (($data[$last_index] =~ tr/"//) % 2 == 1) { $_[1] .= " " . <$fh>; @data = &resolve_comma_delimited_file_line; } return @data; }

        UPDATE: Deleted comments related to tabs since they wouldn't be relevant for the comma case.

        That being said, Text::CSV is still probably a better option because it saves you from having to "reinvent the wheel" so to speak, and probably is a lot more robust solution for the types of hiccups you may encounter in your file data.

        Just another Perl hooker - will code for food