splitting cvs file without line breaks

rendier has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: splitting cvs file without line breaks by shmem (Chancellor) on May 15, 2007 at 16:33 UTC
Well, based on the sample snippet, there's was a line break just before each date, right? You could try `perl -pi.bak -e 's/ (?:\d{4}-\d{2}-\d{2})/\r\n/g' filename` [download] That inserts CRLF before the date at all instances and saves the original to filename.bak - see perlrun. update:* Oops, typo. I mean `perl -pi.bak -e 's/ *(?=\d{4}-\d{2}-\d{2})/\r\n/g' filename` [download] Thanks, johngg! --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l] [select]
Re^2: splitting cvs file without line breaks by johngg (Canon) on May 15, 2007 at 19:17 UTC
`s/ *(?:\d{4}-\d{2}-\d{2})/\r\n/g` Perhaps I'm missing something but doesn't that replace the date rather than inserting before the date? Perhaps make the date a capture and replace with `\r\n$1`? Cheers, JohnGG	[reply] [d/l] [select]
Re^2: splitting cvs file without line breaks by Skeeve (Parson) on May 15, 2007 at 20:28 UTC
using `/ ` instead of `/ +` might result in too many linefeeds being inserted. And why do you use \r\n insteat of just \n `s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%` `+`.+=%;.#_}\&"^"-+%).}%:##%}={~=~:.")&e&&s""`$''`"e	[reply] [d/l] [select]
Re^3: splitting cvs file without line breaks by shmem (Chancellor) on May 15, 2007 at 21:11 UTC
using / * instead of / + might result in too many linefeeds being inserted. Agreed. And why do you use \r\n insteat of just \n Well, that's because... well, er... :-) You're right again, probably inserting `$/` makes more sense. But I've seen csv files that use "\r\n" as record separator, and "\n" as newlines inside fields. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply] [d/l]
Re: splitting cvs file without line breaks by BrowserUk (Patriarch) on May 15, 2007 at 20:11 UTC
This should work for a variable number of fields provided that: Quoted fields don't contain quotes. The last field in each record is either quoted, or does not contain spaces. #! perl -slw use strict; my $data = do{ local $/; <DATA> }; my $noOfFieldsMinus1 = 5; my @records = $data =~ m[ ( (?: (?: "[^"]+" ##" \| [^,]+ ) , ){$noOfFieldsMinus1} (?: "[^"]+" ##" \| \S+ ) ) \s+ ]gx; print for @records; __DATA__ "Business Date","Location Name","Revenue Center Name","Tender Count"," +Tender Name","Tender Total" 2007-05-14 00:00:00.0,"Aville","x",300,"b +",6899 2007-05-14 00:00:00.0,"Aville","x",6,"c",198.50 2007-05-14 00: +00:00.0,"Aville","b",290,"Cash",12336.10 2007-05-14 00:00:00.0,"Bvill +e","c",14,"d",958.40 [download] Produces: `C:\test>615576 "Business Date","Location Name","Revenue Center Name","Tender Count"," +Tender Name","Tender Total" 2007-05-14 00:00:00.0,"Aville","x",300,"b",6899 2007-05-14 00:00:00.0,"Aville","x",6,"c",198.50 2007-05-14 00:00:00.0,"Aville","b",290,"Cash",12336.10 2007-05-14 00:00:00.0,"Bville","c",14,"d",958.40` [download] Subsequent treatment of the csvs by any of the usual mechanisms. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: splitting cvs file without line breaks by rendier (Initiate) on May 15, 2007 at 21:06 UTC
Hey, great!! This is exactly what I was looking for!! I'll read this slowly with perlre next to me, and then stick it in ;-) Thanks a lot	[reply]
Re: splitting cvs file without line breaks by kyle (Abbot) on May 15, 2007 at 16:37 UTC
If I weren't concerned about a comma appearing inside a quoted field, I might do this: `my $columns = 6; # or 10 or 14 local $/ = ','; my @fields = (); while (<>) { push @fields, $_; if ( scalar @fields == $columns ) { print @fields; @fields = (); } } print @fields;` [download] If I do think that a field might have a comma in it, I think I'd try to use Text::CSV. `use Text::CSV; my $wholefile = do { local $/; <> }; my $parser = Text::CSV->new(); my $status = $parser->parse( $wholefile ); my @fields = $parser->fields(); my $columns = 6; # or 10 or 14 while ( @fields ) { my @line_fields = splice @fields, 0, $columns; my $csv = Text::CSV->new(); my $status = $csv->combine( @line_fields ); print $csv->string(); }` [download] I'm not really all that familiar with Text::CSV, so I wrote that based on the synopsis. Take it with a grain of salt. Update: Since writing this, I notice that shmem saw something I missed, namely that the last column of the original file naturally does not have a comma between it and the first column of the original file. Neither of my solutions work in that case. The file would have to be repaired first, as suggested.	[reply] [d/l] [select]
Re^2: splitting cvs file without line breaks by rendier (Initiate) on May 15, 2007 at 18:56 UTC
thanks all for your input. I was thinking of using a regex based on matching x comma separated fields, and cutting it of the rest of the line, so that I would not have to hard code it based on a date or any other type of content. something like `$file =~ m/((\".\"\|\d+\|[\d :]+),){6} (.)/; $line = $1; $restoffile = $2;` [download] I'll see if this works, otherwise I'll punch up one of these suggestions. rendier	[reply] [d/l]
Re: splitting cvs file without line breaks by Limbic~Region (Chancellor) on May 15, 2007 at 16:30 UTC
rendier, Why can't you still use one of the CSV modules? My experience with Text::CSV for instance tells me you can read the one long line into an @fields array and then process @fields in groups of N. Of course, you will want to write the file back out correctly so that you only have to go through this exercise once. Cheers - L~R	[reply]
Re^2: splitting cvs file without line breaks by shmem (Chancellor) on May 15, 2007 at 16:35 UTC
Hmm... based on the sample input "Tender Total" 2007-05-14 00:00:00.0 would make up just one field? And that's where the line break was.. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^3: splitting cvs file without line breaks by Limbic~Region (Chancellor) on May 15, 2007 at 17:09 UTC
shmem, You are correct and I was wrong. I had incorrectly assumed that the line endings had been replaced with the CSV delimeter. The process I outlined could still work but it would be more cumbersome (splitting Nth fields). Cheers - L~R	[reply]