Re: swapping PIPE for comma in CSV file
by Tux (Canon) on Jun 26, 2007 at 13:31 UTC
|
You could use Text::CSV_XS with two objects, and do it safe (untested):
use strict;
use warnings;
use IO::Handle;
use Text::CSV_XS;
my $csv_in = Text::CSV_XS->new ({
binary => 1,
});
my $csv_out = Text::CSV_XS->new ({
binary => 1,
sep_char => "|",
escape_char => "\\",
eol => "\n",
});
while (my $row = $csv_in->getline (*ARGV)) {
$csv_out->print (*STDOUT, $row);
}
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] |
|
|
For the OP just to emphasize the importance of binary in the case you have a def. of CSV that permits embedded newlines.
For Merijn.
I was going to answer more or less the same to the OT yesterday, but came across a few problems, that made me reinstall the latest versions...
- One problem was that I used IO::Wrap objects for stdin and stdout and they don't work with the pure perl version, I am not sure why. Maybe it would be better to load IO::Handle directly and have something for those who want efficiency. In this thread I wanted to test the pure perl version as installing an XS module could have been problematic for thew OP. I think that keeping in sync both versions is important...
- for some reason search.cpan.org gives the version 0.29 Text::CSV_XS but
perl -MCPAN -e install qw(Text::CSV_XS)' installs 0.30 the right one I believe if I remember your post on p5p or pm.
% steph@apexPDell2 (/home/stephan) %
% cat conv_comma2pipe_xs.px
#!/usr/bin/perl
use strict;
use warnings;
$|++;
#use IO::Handle;
use IO::Wrap;
use Text::CSV_XS;
# use DDS;
# my $in = IO::Wrap::wraphandle(\*STDIN) or die;
# my $out = IO::Wrap::wraphandle(\*STDOUT) or die;
# Dump\($in, $out);
my $csv_in = Text::CSV_XS->new({
binary => 1,
}) or die;
my $csv_out = Text::CSV_XS->new({
binary => 1,
sep_char => q{|},
eol => qq{\n},
}) or die;
while (defined (my $rec = $csv_in->getline(\*STDIN)) ) {
{ my @fields = @$rec;
local $"=q{][}; print {\*STDERR} ".rec [@fields]\n";
}
$csv_out->print(\*STDOUT, $rec);
}
__END__
% steph@apexPDell2 (/home/stephan) %
% cat hi1.csv | perl+ -w conv_comma2pipe_xs.px
.rec [a][b][c]
a|b|c
.rec [a][okay, comma][c]
a|"okay, comma"|c
.rec [a][long
line, indeed][end]
a|"long
line, indeed"|end
% steph@apexPDell2 (/home/stephan) %
% cat hi1.csv
a,b,c
a,"okay, comma",c
a,"long
line, indeed",end
cheers
--stephan p.s I tested on cygwin with perl 5.8.7 and 5.8.8
update: oops forgot the code...
| [reply] [d/l] |
|
|
One problem was that I used IO::Wrap objects for stdin and stdout and they don't work with the pure perl version, I am not sure why. Maybe it would be better to load IO::Handle directly and have something for those who want efficiency. In this thread I wanted to test the pure perl version as installing an XS module could have been problematic for thew OP. I think that keeping in sync both versions is important...
The maintainer of Text::CSV_PP is doing a real nice job in trying to keep it in sync with Text::CSV_XS and we do have (a lot) of contact about that. I already had a look at version 1.06, and it passed all tests for 0.30, except the diagnostics tests, which is logical and explainable.
That maintainer also got the maintainership for the very old Text::CSV, which will be a wrapper around Text::CSV_XS and Text::CSV_PP and choose the one available, based on a method used in DBI::PurePerl: the environment variable TEXT_CSV_XS, and will default to the fastest method available.
I have been thinking about the use of IO::Handle, making it either default, or use'd automatically, but everything I came up with so far will imply a slowdown, which is IMHO unacceptable. I't a bit of a shame that this is a relative expensive module to load (14 kb of source code).
for some reason search.cpan.org gives the version 0.29 Text::CSV_XS but perl -MCPAN -e install qw(Text::CSV_XS)' installs 0.30 the right one I believe if I remember your post on p5p or pm.
Maybe I've been working too hard lately on this module, and uploaded too many versions :) Give CPAN some time to sync around the world.
I'll have a look at the IO::Wrap thingy
Enjoy, Have FUN! H.Merijn
| [reply] [d/l] [select] |
Re: swapping PIPE for comma in CSV file
by tirwhan (Abbot) on Jun 26, 2007 at 13:45 UTC
|
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
while (<DATA>) {
my $line = $csv->parse($_);
print join("|",$csv->fields)."\n";
}
__DATA__
"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3
+C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN",""
"000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E
+F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN",""
"0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8
+7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris",""
"00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC
+7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""
OTOH, why do this at all? Commas within quotation marks are common enough in CSV files and (as you've found out) are easy enough to handle.
| [reply] [d/l] |
|
|
Nice. I'll have to dig into CSV_XS.
Yeah, I can handle it as is, but I'm getting asked to convert it for others who can't. It's due to a legacy spec that wasn't the greatest, but whatcha gonna do? (I got them to change the spec to put the three static width columns first, and paid for that with a lot of blood - not gonna ask for more)
| [reply] |
Re: swapping PIPE for comma in CSV file
by pseudomonas (Monk) on Jun 26, 2007 at 13:24 UTC
|
If your data is always going to have that number of fields, you can just work from the front and the back and comma-join all the bits in the middle.
perl -F, -lane'chomp;print join("|",@F[0..2], join(",", @F[3..$#F-4]),@F[-3..-1])' A/NSRLFile.txt | [reply] [d/l] |
|
|
Yeah - that's what I do now, in a VERY long-winded way. I didn't know about @F . Cool!
| [reply] |
Re: swapping PIPE for comma in CSV file
by johngg (Canon) on Jun 26, 2007 at 14:59 UTC
|
Using split with a limit along with reverse seems to be a good way to isolate the fields in this particular case. However, I would agree with some other Monks recommendations to use a module in most circumstances. I have cut down the length of the fields to avoid too much line wrapping.
use strict;
use warnings;
while ( <DATA> )
{
my @f012 = split m{,}, $_, 4;
my $rest = pop @f012;
my @f7654 = map { my $r = reverse } split m{,}, reverse($rest), 5;
my $f3 = pop @f7654;
print join q{|}, @f012, $f3, reverse @f7654;
}
__END__
"13F2","E3C9","05E5","J0180794.JPG",32768,3290,"WIN",""
"D05C","2EF2","5E8D","WabIab, and more.bor",4760,4616,"WIN",""
"6DAC","B87B","8D89","fpSDt,Finder,Link.gif",1161,2988,"Solaris",""
"4DE1","BC7A","2D72","cmnres,pdb.dll",76800,1550,"WIN",""
The output.
"13F2"|"E3C9"|"05E5"|"J0180794.JPG"|32768|3290|"WIN"|""
"D05C"|"2EF2"|"5E8D"|"WabIab, and more.bor"|4760|4616|"WIN"|""
"6DAC"|"B87B"|"8D89"|"fpSDt,Finder,Link.gif"|1161|2988|"Solaris"|""
"4DE1"|"BC7A"|"2D72"|"cmnres,pdb.dll"|76800|1550|"WIN"|""
I hope this is of interest. Cheers, JohnGG
Update: There's no need for the lexical in the map, $_ will do, and there's no need to pop off $f3 as we are not doing anything with it. Code becomes
while ( <DATA> )
{
my @f012 = split m{,}, $_, 4;
my $rest = pop @f012;
my @f76543 = map { $_ = reverse } split m{,}, reverse($rest), 5;
print join q{|}, @f012, reverse @f76543;
}
Update 2: Using scalars for the first split means the pop can be avoided and incorporate all the rest in the print statement. Code becomes
while ( <DATA> )
{
my ($f0, $f1, $f2, $rest) = split m{,}, $_, 4;
print join q{|},
$f0, $f1, $f2,
reverse map { $_ = reverse } split m{,}, reverse($rest), 5;
}
| [reply] [d/l] [select] |
Re: swapping PIPE for comma in CSV file
by pseudomonas (Monk) on Jun 26, 2007 at 13:37 UTC
|
Alternatively, you can always use regexes. This is being silly, though.
perl -lane'sub f{s/^([^,]+),/$1|/} sub b{s/,([^,]+)$/|$1/} f;f;f;b;b;b;b; print' A/NSRLFile.txt
Like another commenter suggested, though, I'd use a module unless you really have a good reason not to.
| [reply] [d/l] |
Re: swapping PIPE for comma in CSV file
by Anonymous Monk on Jun 26, 2007 at 14:12 UTC
|
I generally don't have the luxury of using anything not mentioned on perlfunc. This works for me without any special functions:
#!/usr/bin/perl -w
use strict;
{
print join ('|', (/(^"[^"]*"),("[^"]*"),("[^"]*"),("[^"]*"),([0-9]
+*),([0-9]*),("[^"]*")/), "\"\"\n") while (<DATA>);
}
__DATA__
"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3
+C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN",""
"000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E
+F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN",""
"0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8
+7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris",""
"00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC
+7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""
It marks everything in a field, with the fourth field (third if zero based) being the variable-length one. I'm sure you could break it down even further using only perlfuncs but it's a working solution based on the provided input.
This is my first post to perlmonks btw but I'm a longtime perl user. | [reply] [d/l] |
Re: swapping PIPE for comma in CSV file
by LighthouseJ (Sexton) on Jun 27, 2007 at 14:24 UTC
|
I posted the solution above with only using perlfunc. I just registered this login today.
I found yet another solution, which is radically different from anything else I've seen.
I got to thinking about my other comment correcting someone else that we should be concerned less with the commas and more with the quotes because they provide a more concrete "rule" for delimiting the items. I was writing an array definition for another file and noticed the definition was identical to the provided data, so I wrote a program which simply evals the data into an array, then prints it out in the right format. I'm sure somebody can write something to clean up the output and make it more elegant but the eval does a bang-up job of reading the data properly.
#!/usr/bin/perl -w
use strict;
{
while (<DATA>) {
my @i = ();
eval "\@i = ($_);";
print "\"$i[0]\"|\"$i[1]\"|\"$i[2]\"|\"$i[3]\"|$i[4]|$i[5]|\"$i[6]
+\"|\"$i[7]\"\n";
}
}
__DATA__
"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D62996392FE3
+C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN",""
"000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E
+F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN",""
"0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8
+7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris",""
"00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC
+7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""
Hopefully this is useful to somebody.
"The three principal virtues of a programmer are Laziness, Impatience, and Hubris. See the Camel Book for why." -- `man perl`
| [reply] [d/l] |
|
|
perl -wlpe
"s/(\042[^\042]*?\042)/'$1'/g; $_ = join '|', eval qq/($_)/"
NSRLFile.csv > NSRLfile.psv
see previous post for definitions of NSRLFile.* files.
\042 is a " char. | [reply] [d/l] |
Re: swapping PIPE for comma in CSV file
by jwkrahn (Abbot) on Jun 26, 2007 at 14:14 UTC
|
$ echo '"00000142988AFA836117B1B572FAE4713F200567","9B3702B0E788C6D629
+96392FE3C9786A","05E566DF","J0180794.JPG",32768,3290,"WIN",""
"000005EE5E3F6961B78CE4549270DE5D05CBC0CB","8D025B6AE1994A40FCBB5AEC2E
+F273F9","5E8D7D42","WabIab, and more.bor",4760,4616,"WIN",""
"0000085FC602CD8AD4793A874A47D286DACB0F6A","8BA8BC04896C421A704282E9B8
+7B5520","8D89A85D","fpSDt,Finder,Link.gif",1161,2988,"Solaris",""
"00000FF9D0ED9A6B53BC6A9364C07074DE1565F3","A5D49D6DA9D78FD1E7C32D58BC
+7A46FB","2D729A1E","cmnres,pdb.dll",76800,1550,"WIN",""' | \
perl -nle'print join q/"|"/, split /","/'
"00000142988AFA836117B1B572FAE4713F200567"|"9B3702B0E788C6D62996392FE3
+C9786A"|"05E566DF"|"J0180794.JPG",32768,3290,"WIN"|""
"000005EE5E3F6961B78CE4549270DE5D05CBC0CB"|"8D025B6AE1994A40FCBB5AEC2E
+F273F9"|"5E8D7D42"|"WabIab, and more.bor",4760,4616,"WIN"|""
"0000085FC602CD8AD4793A874A47D286DACB0F6A"|"8BA8BC04896C421A704282E9B8
+7B5520"|"8D89A85D"|"fpSDt,Finder,Link.gif",1161,2988,"Solaris"|""
"00000FF9D0ED9A6B53BC6A9364C07074DE1565F3"|"A5D49D6DA9D78FD1E7C32D58BC
+7A46FB"|"2D729A1E"|"cmnres,pdb.dll",76800,1550,"WIN"|""
| [reply] [d/l] |
|
|
You forgot what I initially forgot, and that's the numbers that don't have quotes. I think the end one line should look like:
"|"WabIab, and more.bor"|4760|4616|"WIN"|""
instead of:
"|"WabIab, and more.bor",4760,4616,"WIN"|""
But somebody can feel free to correct me if I'm wrong. | [reply] [d/l] [select] |
Re: swapping PIPE for comma in CSV file
by Ploux (Acolyte) on Jun 26, 2007 at 15:05 UTC
|
s/([0-9"]),([0-9"])/$1\|$2/g
| [reply] [d/l] |
|
|
An embedded comma might sit between two numerals or at the beginning or end of the field next to the quotes.
| [reply] |
|
|
Somehow I knew it couldn't be that easy.
| [reply] |
Re: swapping PIPE for comma in CSV file
by girarde (Hermit) on Jun 26, 2007 at 15:15 UTC
|
| [reply] [d/l] [select] |
|
|
A few people missed this (myself included).
You have to split on a pair of double quotes or no quotes at all. There's the field with commas in it which you can't replace with a pipe so a simple regex or split/join won't work (at least without a massive regex on the split).
| [reply] |
|
|
Right, but I'm splitting on the 'quote, comma, quote' pattern, so embedded commas sail through. You can split on a pattern, although join wants a string.
| [reply] [d/l] [select] |
|
|
|
|
Re: swapping PIPE for comma in CSV file
by Anonymous Monk on Jun 26, 2007 at 20:49 UTC
|
all you need is not_quote-comma-quote, quote-comma-quote, and quote-comma-not_quote:
echo <DATA> | perl -F'(?<=[^"]),(?=")|(?<="),(?=")|(?<="),(?=[^"])' -lane 'print join("|",@F)'
though i could be wrong, i just learned about look-ahead/behind today
--jd | [reply] [d/l] |
Re: swapping PIPE for comma in CSV file
by Anonymous Monk on Jun 27, 2007 at 09:20 UTC
|
i would strongly agree with others who have recommended use of a standard,
well tested csv-parsing module.
however, if you really have to go commando (and do it on the command line
to boot), the following (tested) code should work.
note that:
-
NSRLFile.csv contains the example data from the op
-
NSRLFile.psv contains the processed, pipe character-separated data
-
the code makes no assumptions about the absence or presence whitespace
around the value-separating commas or at the beginning or end of records
-
double-quoted strings can contain anything, even backslash-escaped
double-quotes (or any other backslash-escaped character)
-
the only value supported other than a double-quoted string is a simple
numeric with NO sign, decimal point or embedded commas
-
the code makes no assumptions about the order of the fields in a record
-
had to use \042 in place of a double-quote character because otherwise,
the xp command-line interpreter gets confused and thinks the first pipe
character it sees (in a regex) is an actual command-line pipe operator
perl -wlpe "use strict; my $quo = qq(\042); my $esc = qq(\\\\);
my $body = qr/[^$quo$esc]/; my $escaped = qr/$esc ./x;
my $sep = qr/ \s* , \s* | \s* \Z /x;
my $quoted = qr( $quo $body* (?: $escaped $body* )* $quo )x;
my $number = qr(\d+);
$_ = join '|', m/( $quoted | $number ) (?= $sep )/gx"
NSRLFile.csv > NSRLfile.psv
| [reply] [d/l] |
Re: swapping PIPE for comma in CSV file
by Moron (Curate) on Jun 27, 2007 at 10:23 UTC
|
I don't see why you can't just s/\",\"/\"|\"/g - embedded commas won't mess that up - even a single "," field won't mess up because the "," match at separator preceding it will take precedence and s/ will subsequently position after the executed substition at ,",", causing the next match to occur also at the correct place. (Update: Tested it now on "fred",",","bert" and it worked). I am not going to jump on the "always use a module" bandwagon, because only the OPer can possibly know what the issues are surrounding that - I seem to have to keep saying it in here: it is unsafe to draw conclusions when potentially missing crucial information.
__________________________________________________________________________________
^M Free your mind!
| [reply] |
|
|
I don't see why you can't just s/\",\"/\"|\"/g
Perhaps you should have taken a better look at the data in the original node then, or read the several answers that detail why that won't work in this case! (Hint: there are unquoted numbers in the data). Honestly, posting an answer to a day-old thread without even reading the preceding answers seems a complete waste of everyones time.
| [reply] [d/l] |