amelinda has asked for the wisdom of the Perl Monks concerning the following question:
So, I have this CSV file of data. Some numbers, some text. The first few lines look like:
'12342248','BAR ','12JUNK ','6532918 12" JUNK ON BAR-709/8709/811 ',
+' ','A',01,'N',' ','01',000000,+000030.00,+000150.00,'I','
+ '
'13121751','FOO ','13FUNK ','6715500 FOO FUNK-331/334 ',
+' ','A',03,'N',' ','99',000000,+000134.00,+000275.00,'I','
+ '
So, really, I can just strip off the 's and extra spaces and then split on the commas, right? That's what I currently have going. All seemed to be ok... But I am horribly wrong.
Further down, there are lines like...
'18592038','BL ','17BLE ','6715500 17" BLETCH BLE-333,334 ',
+' ','A',01,'N',' ','00',000000,+000007.00,+000200.00,'R','
+ '
Notice how one of the text fields has a few entries with an embedded comma. In some, other records, one of the other text fields has an embedded '. Now what?
Unpack seems to be the answer, but the documentation in the Camel (and what I've supersearched here) is pretty sketchy. Can anyone help? I'd really appreciate explanations that include how and why, so that I might be converted to the happy use of unpack.
(tye)Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by tye (Sage) on Oct 31, 2001 at 22:47 UTC
|
@fields= unpack
"xA8x3A5x3A8x3A37x3A1x3A1x2A2x2A1x3A2x3A2x2A6x1"
. "A10x1A10x2A1x3A16", $record;
should work.
-
tye
(but my friends call me "Tye") | [reply] [Watch: Dir/Any] [d/l] |
|
That makes some sense... A seems to be for text (ascii?)... though I'm not sure what the difference tween A and a is. But what're the x's for? Single chars? Punctuation? !text?
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re (tilly) 1: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by tilly (Archbishop) on Oct 31, 2001 at 22:52 UTC
|
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
It could be made to though. That could even be made
configurable, though I would prefer it to default to not
for the simple reason that I don't like it.
A bigger problem is that I only consider " a quote
character. I will need to check whether ' also qualifies
for Microsoft products. If it does I should make it
handle that as well. (They never bother writing it, so
I didn't either.) If not, then I could make that
configurable as well...
| [reply] [Watch: Dir/Any] |
|
|
|
Actually, I'm more interested in how it could handle the unescaped embedded comma in one of the fields.
If only splitting (or Text::xSVing) would handle that extraneous comma, i could worry about stripping off the quote marks later. It was the combination the unescaped embedded comma, the unescaped embedded quote, and the fact that only some fields were marked off in quotes that led me down the road to unpack.
| [reply] [Watch: Dir/Any] |
|
Embedded commas aren't a problem for CSV. The whole point of putting quotes around fields in the CSV is so that embedded commas can be dealt with. Embedded quotes aren't a problem if they are escaped (by doubling them).
If you are curious how this is done, then I encourage you to download Text::CSV and/or Text::xSV (or just look at Text::xSV locally) and look for yourself. (:
-
tye
(but my friends call me "Tye")
| [reply] [Watch: Dir/Any] |
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by amelinda (Friar) on Oct 31, 2001 at 22:44 UTC
|
Bother. I should have mentioned an additional quirk.
One reason I went with split is for the following construction:
push (@items, [split(/,/, $line)]); where @items is a LoL.
If I have to go with unpack, how should I go about pushing them into @items?
Did I mention that this is not going to be on my server, so I can't rely on the presence of modules? :sigh:
Update: go go gadget unpack. It works now.
Thanks go especially to tye. I appreciate the other answers too (I'm definitely going to look into Text::ParseWords and Text::xSV for this in the future with cleaner data), but most of them had issues with either the the embedded comma or the embedded single quote.
As for those who berated me (again) for not wanting to use modules, I submit that I looked to unpack because a) it looked like the only solution that would cope with the embedded delineators issue and b) I didn't really understand how unpack works and I knew I needed to learn about it... especially if i am ever to grok obfuscated code. :)
Update 2: To clarify, these would be unescaped commas and quotemarks. Also, I didn't provide a line of data here that had an unescaped, embedded quotemark, but I do have some. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by MZSanford (Curate) on Oct 31, 2001 at 22:44 UTC
|
You could use unpack similar to the following (as long as the fields are fixed width) :
my $data = qq!'a','b, and c','d '!;
my ($a,undef,$b,undef,$c) = unpack("a3a1a10a1a3",$data);
If the fields are not fixed width, unpack may not be the bext way to go.
i had a memory leak once, and it ruined my favorite shirt. | [reply] [Watch: Dir/Any] [d/l] |
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by davorg (Chancellor) on Oct 31, 2001 at 23:39 UTC
|
use Text::ParseWords;
my @items;
while (<>) {
push @items, [ quotewords(',', 0, $_) ];
}
It's part of the standard Perl distribution.
--
<http://www.dave.org.uk>
"The first rule of Perl club is you don't talk about
Perl club."
| [reply] [Watch: Dir/Any] [d/l] |
|
| [reply] [Watch: Dir/Any] |
|
Hmm... well you're right that it would break
on unescaped single quotes, but looking at amelinda's
data I couldn't see any. Double quotes yes, but not single
quotes. That's why I sugggested it.
--
<http://www.dave.org.uk>
"The first rule of Perl club is you don't talk about
Perl club."
| [reply] [Watch: Dir/Any] |
|
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by Zaxo (Archbishop) on Oct 31, 2001 at 22:55 UTC
|
chomp $line;
my @record = split /','/, $line;
substr($record[0],0,1) = '';
chop $record[-1];
I threw in a variety of functions to enjoy.
After Compline, Zaxo | [reply] [Watch: Dir/Any] [d/l] |
|
Hm. Close. I thought of that too. But that breaks on at least one field which is not enclosed with 's.
| [reply] [Watch: Dir/Any] |
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by shotgunefx (Parson) on Nov 01, 2001 at 01:47 UTC
|
I work with CSV files a LOT.
I'd go with Text::CSV or Text::xSV personally.
I don't know where the data is coming from, but it should be getting exported correctly in the first place. Even if you hack a solution with split or unpack, it's probably only a matter of time before a change in data causes your workaround to start failing again.
-Lee
"To be civilized is to deny one's nature." | [reply] [Watch: Dir/Any] |
Re: parsing CSV file with embedded commas (fortunately, fixed-width) - is unpack the solution?
by runrig (Abbot) on Nov 01, 2001 at 05:03 UTC
|
unpack is a good answer as long as it's a Fixed Width
format file and nothing else seems to work quite right since
its not quite actual CSV. You might also want to try Parse::FixedLength, it might go something like this: use Parse::FixedLength:
# Set field names and lengths
# (you might want to have more meaningful names) :-)
my $parser = Parse::FixedLength->new([
field1=>8,
field2=>10,
...etc.
]);
my @names = @{$parser->names};
while (<>) {
my $data = $parser->parse($data);
# Strip leading and trailing quotes and spaces?
s/^'?\s+|\s+'?,$//g for values %$data;
# Output real csv, e.g.
s/"/""/g for values %$data;
print join(",", map qq("$_"), @$data{@names}),"\n";
}
| [reply] [Watch: Dir/Any] [d/l] |
|
Hi,
I have tried your new method but it's some error message here :
Can't locate object method "new" via package "Parse::FixLength" (perhaps you forgot to load "Parse::FixLength"?) at ./scal.pl line 18.
Do you know what that's mean ?
| [reply] [Watch: Dir/Any] |
|
Can't locate object method "new" via package "Parse::FixLength"
Did you install the module? If you installed it, you must have spelled it correctly once ('Parsed::FixedLength'). You just need to spell it correctly again.
| [reply] [Watch: Dir/Any] |
|
|