rendier has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have some csv files where the line breaks got lost, so the file comes as one long line.

For example (this one has 6 columns):

"Business Date","Location Name","Revenue Center Name","Tender Count"," +Tender Name","Tender Total" 2007-05-14 00:00:00.0,"Aville","x",300,"b +",6899 2007-05-14 00:00:00.0,"Aville","x",6,"c",198.50 2007-05-14 00: +00:00.0,"Aville","b",290,"Cash",12336.10 2007-05-14 00:00:00.0,"Bvill +e","c",14,"d",958.40

I want to split this file into lines, and then I can process the lines using the normal csv handling stuff (split or Text::CSV).

The question is - what is the slick Perl Way to split this into lines? Some files have 6 columns, others have 10 or 14 .

Please help
reinier

Replies are listed 'Best First'.
Re: splitting cvs file without line breaks
by shmem (Chancellor) on May 15, 2007 at 16:33 UTC

    Well, based on the sample snippet, there's was a line break just before each date, right? You could try

    perl -pi.bak -e 's/ *(?:\d{4}-\d{2}-\d{2})/\r\n/g' filename

    That inserts CRLF before the date at all instances and saves the original to filename.bak - see perlrun.

    update: Oops, typo. I mean

    perl -pi.bak -e 's/ *(?=\d{4}-\d{2}-\d{2})/\r\n/g' filename

    Thanks, johngg!

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      s/ *(?:\d{4}-\d{2}-\d{2})/\r\n/g

      Perhaps I'm missing something but doesn't that replace the date rather than inserting before the date? Perhaps make the date a capture and replace with \r\n$1?

      Cheers,

      JohnGG

      using / * instead of / + might result in too many linefeeds being inserted.

      And why do you use \r\n insteat of just \n


      s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
      +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        using / * instead of / + might result in too many linefeeds being inserted.

        Agreed.

        And why do you use \r\n insteat of just \n

        Well, that's because... well, er... :-)
        You're right again, probably inserting $/ makes more sense. But I've seen csv files that use "\r\n" as record separator, and "\n" as newlines inside fields.

        --shmem

        _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                      /\_¯/(q    /
        ----------------------------  \__(m.====·.(_("always off the crowd"))."·
        ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: splitting cvs file without line breaks
by BrowserUk (Patriarch) on May 15, 2007 at 20:11 UTC

    This should work for a variable number of fields provided that:

    1. Quoted fields don't contain quotes.
    2. The last field in each record is either quoted, or does not contain spaces.
    #! perl -slw use strict; my $data = do{ local $/; <DATA> }; my $noOfFieldsMinus1 = 5; my @records = $data =~ m[ ( (?: (?: "[^"]+" ##" | [^,]+ ) , ){$noOfFieldsMinus1} (?: "[^"]+" ##" | \S+ ) ) \s+ ]gx; print for @records; __DATA__ "Business Date","Location Name","Revenue Center Name","Tender Count"," +Tender Name","Tender Total" 2007-05-14 00:00:00.0,"Aville","x",300,"b +",6899 2007-05-14 00:00:00.0,"Aville","x",6,"c",198.50 2007-05-14 00: +00:00.0,"Aville","b",290,"Cash",12336.10 2007-05-14 00:00:00.0,"Bvill +e","c",14,"d",958.40

    Produces:

    C:\test>615576 "Business Date","Location Name","Revenue Center Name","Tender Count"," +Tender Name","Tender Total" 2007-05-14 00:00:00.0,"Aville","x",300,"b",6899 2007-05-14 00:00:00.0,"Aville","x",6,"c",198.50 2007-05-14 00:00:00.0,"Aville","b",290,"Cash",12336.10 2007-05-14 00:00:00.0,"Bville","c",14,"d",958.40

    Subsequent treatment of the csvs by any of the usual mechanisms.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hey, great!! This is exactly what I was looking for!! I'll read this slowly with perlre next to me, and then stick it in ;-) Thanks a lot
Re: splitting cvs file without line breaks
by kyle (Abbot) on May 15, 2007 at 16:37 UTC

    If I weren't concerned about a comma appearing inside a quoted field, I might do this:

    my $columns = 6; # or 10 or 14 local $/ = ','; my @fields = (); while (<>) { push @fields, $_; if ( scalar @fields == $columns ) { print @fields; @fields = (); } } print @fields;

    If I do think that a field might have a comma in it, I think I'd try to use Text::CSV.

    use Text::CSV; my $wholefile = do { local $/; <> }; my $parser = Text::CSV->new(); my $status = $parser->parse( $wholefile ); my @fields = $parser->fields(); my $columns = 6; # or 10 or 14 while ( @fields ) { my @line_fields = splice @fields, 0, $columns; my $csv = Text::CSV->new(); my $status = $csv->combine( @line_fields ); print $csv->string(); }

    I'm not really all that familiar with Text::CSV, so I wrote that based on the synopsis. Take it with a grain of salt.

    Update: Since writing this, I notice that shmem saw something I missed, namely that the last column of the original file naturally does not have a comma between it and the first column of the original file. Neither of my solutions work in that case. The file would have to be repaired first, as suggested.

      thanks all for your input. I was thinking of using a regex based on matching x comma separated fields, and cutting it of the rest of the line, so that I would not have to hard code it based on a date or any other type of content.

      something like

      $file =~ m/((\".*\"|\d+|[\d :]+),){6} (.*)/; $line = $1; $restoffile = $2;

      I'll see if this works, otherwise I'll punch up one of these suggestions.

      rendier

Re: splitting cvs file without line breaks
by Limbic~Region (Chancellor) on May 15, 2007 at 16:30 UTC
    rendier,
    Why can't you still use one of the CSV modules? My experience with Text::CSV for instance tells me you can read the one long line into an @fields array and then process @fields in groups of N. Of course, you will want to write the file back out correctly so that you only have to go through this exercise once.

    Cheers - L~R

      Hmm... based on the sample input
      "Tender Total" 2007-05-14 00:00:00.0

      would make up just one field? And that's where the line break was..

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
        shmem,
        You are correct and I was wrong. I had incorrectly assumed that the line endings had been replaced with the CSV delimeter. The process I outlined could still work but it would be more cumbersome (splitting Nth fields).

        Cheers - L~R