CSV nightmare

lorenzov has asked for the wisdom of the Perl Monks concerning the following question:

hi guys, i mentioned this problem to somebody today in the chatterbox, but had no luck, therefore i put myself in the hands of the experts. basically I'm trying to parse a csv file and pass the values in a database. even before getting to the database part, however, I'm having issues with the parsing.

my original solution (very sure this is sub-optimal!) was based on a regex expression matching:

#solution 1, using regex and parsewords
use Text::ParseWords;
my $file     = "zs_PS0001_9_epack_2006b.csv";
open(FILE,"<$file") or die("Could not open $file: $!.");

$count = 0;
    
# Process the data.
while (<FILE>){
    # Line used to debug the script taking only a few lines.
    exit if ($count >= 20);
    #print "=========================== Record $count\n";
    $count++;
    print "myCounter: ", $count , "\n";

    @words= $_ =~ m/"[^"\r\n]*"|[^,\r\n]*/mg;
    @result = &quotewords(',', 0, @words );

#build the array for the database command
#@myArray=($result[0],$result[2],$result[4],$result[6],$result[8],$res
+ult[10],$result[12],$result[14],$result[16],$result[18]);
  
    my $courseID=$result[0];
    my $sessionID=$result[1];
    my $userID=$result[2];
    my $PersonID=$result[3];
    my $mydatetime=$result[4];
    my $millisecond=$result[5];
    my $loc=$result[6];
    my $action=$result[7];
    my $page=$result[8];
    my $time=$result[9];
        
    print STDOUT "result: ", $courseID,"\n";
    print STDOUT "result: ", $sessionID,"\n";
    print STDOUT "result: ", $userID,"\n";
    print STDOUT "result: ", $PersonID,"\n";
    print STDOUT "result: ", $mydatetime,"\n";
    print STDOUT "result: ", $millisecond,"\n";
    print STDOUT "result: ", $loc,"\n";
    print STDOUT "result: ", $action,"\n";
    print STDOUT "result: ", $page,"\n";
    print STDOUT "result: ", $time,"\n";
    
    print STDOUT "***********\n";
}
# end solution 1_______________________
[download]

# solution 2, using text:csv_xs
use Text::CSV_XS;
my $csv = Text::CSV_XS->new();
$count=0;

my $file = 'PS0002_9_2006b.txt';
if (defined $ARGV[0]) {
    $file = $ARGV[0];
}
 
my $sum = 0;
open(my $data, '<', $file) or die "Could not open '$file'\n";
while (my $line = <$data>) {
    chomp $line;
    exit if ($count >= 20);
    $count++;
    if ($csv->parse($line)) {
        my @columns = $csv->fields();

        $sum += $columns[2];
    } else {
        warn "Line could not be parsed: $line\n";
    }
}
print "$sum\n";
# end solution 2_______________________
[download]

now, if I'm running any of these with my CSV files, this is what i get (example from the first loop):
myCounter: 1
result: ■z s _ P S 0 0 0 1 _ 9 _ e p a c k
result: 9 1 6 9 4 9 8 1 1 0 2 1
result: g u e s t
result: 9 1 6 9 4 7 6 2 7 0 2 1
result: 2 0 0 6 - 1 0 - 1 1 2 3 : 1 2 : 0 0 . 0 0 0
result: 1 1 6 0 6 0 4 7 2 0 0 0 0
result: l o g o u t
result: l o g i n
result: W e b C T V i s t a s e r v e r
result: 0
***********

I've no idea why there are spaces which are not displayed in the file opened with any application... i had a look at the encoding and attempted to convert the files; it seems that with utf-16 the output looks like the above, but I'm loosing the spaces if I'm re-saving it in utf-8.
the problem I'm having is that there is something wrong which is messing up the string; solution 2 doesn't work at all refusing to parse the csv as lines. depending on the encoding of the file you can see a 'funny' box char at the start above, in utf8 it looks like this:
´╗┐zs_PS0001_9_epack ....(rest of the line)
does anyone have any suggestion? thanks for your help!

by the way, i tried also to substitute the first char with no avail: the char is removed from each line but the first(ie)

    
my $courseID=substr($result[0],1);
[download]

finally, if anyone has enlightment about this, when i went back to edit this post i noticed that what i pasted in from the cmd window shows as a code: more precisely as ■ in the first example and as ´╗┐ in the second.

lorenzo

Comment on CSV nightmare Select or Download Code

Replies are listed 'Best First'.
Re: CSV nightmare by ikegami (Patriarch) on Jun 03, 2008 at 01:03 UTC
`open(my $data, '<:encoding(UCS-2le)', $file)` [download] Actually, due to problems with the placement of the :crlf layer, `open(my $data, '<:raw:encoding(UCS-2le):crlf:utf8', $file)` [download] And the funny char is the BOM (U+FEFF). `read($data, my $bom='', 1); # Discard BOM.` [download] (To the best of my knowledge, what Microsoft uses is not really UTF-16 but UCS-2le.)	[reply] [d/l] [select]
Re^2: CSV nightmare by graff (Chancellor) on Jun 03, 2008 at 02:49 UTC
(To the best of my knowledge, what Microsoft uses is not really UTF-16 but UCS-2le.) I used to think that UCS-2LE was a synonym for UTF-16LE (likewise with BE instead of LE). But I then found this bit in a FAQ at unicode.org: Q: What is the difference between UCS-2 and UTF-16? A: UCS-2 is what a Unicode implementation was up to Unicode 1.1, before surrogate code points and UTF-16 were added as concepts to Version 2.0 of the standard. This term should be now be avoided. When interpreting what people have meant by "UCS-2" in past usage, it is best thought of as not a data format, but as an indication that an implementation does not interpret any supplementary characters. In particular, for the purposes of data exchange, UCS-2 and UTF-16 are identical formats. Both are 16-bit, and have exactly the same code unit representation. The effective difference between UCS-2 and UTF-16 lies at a different level, when one is interpreting a sequence code units as code points or as characters. In that case, a UCS-2 implementation would not handle processing like character properties, codepoint boundaries, collation, etc. for supplementary characters. The current version of Unicode is 5.something, so there's not much point in using the "UCS-2" terminology these days. And yes, Microsoft tends to favor the LE byte order, especially for text data (MS-Word, and "plain-text" exports from MS-Office products). But cells in Excel spreadsheets are, for some reason, stored as BE. As far as perl encoding layers are concerned, UTF-16 (with no byte-order spec) tends to mean: for output, byte order is ~~determined by the cpu~~ apparently BE by default; for input, byte order is determined by a stream-initial BOM (if the BOM isn't there, perl complains about it; if it is there, perl ~~does not~~ will remove it for you). (updated last paragraph to reflect ikegami's corrections -- thanks, ike)	[reply]
Re^3: CSV nightmare by ikegami (Patriarch) on Jun 03, 2008 at 03:18 UTC
To me, there are two important differences between UCS-2 and UTF-16. The first important difference is that UCS-2 can only represent U+0000 to U+FFFF, whereas UTF-16 can represent any UNICODE character. The second important difference is the number of bytes UCS-2 and UTF-16 use to store a character. Each UCS-2 character is exactly 16 bits in size, whereas UTF-16 is like UTF-8. Some characters require more than one word. for output, byte order is determined by the cpu No. I'm on an x86 (LE machine), but UTF-16be was used. for input, byte order is determined by a stream-initial BOM (if the BOM isn't there, perl complains about it; if it is there, perl does not remove it for you). No. Perl does remove it for you, just like it adds it for you for output.	[reply]
Re: CSV nightmare by snoopy (Curate) on Jun 03, 2008 at 01:36 UTC
Also, there's another problem with your second solution. Unicode support remains on Text::CSV_XS's todo list. You could instead try the largely compatible Text::CSV::Unicode. #!/usr/bin/perl use warnings; use strict; # solution 2, converted to Text::CSV::Unicode # replacement for Text::CSV_XS use Text::CSV::Unicode; my $csv = Text::CSV::Unicode->new(); my $count=0; my $file = 'PS0002_9_2006b.txt'; if (defined $ARGV[0]) { $file = $ARGV[0]; } my $sum = 0; # ikegami's fix open(my $data, '<:raw:encoding(UCS-2le):crlf:utf8', $file) or die "Cou +ld not open '$file'\n"; while (my $line = <$data>) { chomp $line; exit if ($count >= 20); $count++; if ($csv->parse($line)) { my @columns = $csv->fields(); $sum += $columns[2]; } else { warn "Line could not be parsed: $line\n"; } } print "$sum\n"; # end solution 2_______________________ [download]	[reply] [d/l]
Re^2: CSV nightmare by Tux (Canon) on Jun 03, 2008 at 08:50 UTC
Text::CSV might soon be extended with a layer that deals with encodings, somewhat like this: `use Text::CSV::Encoded; my $csv = Text::CSV::Encoded->new ({ encoding => "utf-8", # Both in and out encoding_in => "utf-16le", # Only the input encoding_out => "cp1252", # Only the output });` [download] Until then, I think `binmode STDOUT, ":utf8"; my $csv = Text::CSV_XS->new ({ binary => 1 }); open my $fh, "<:encoding(utf-16le)", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { print $row->[4]; }` [download] should work Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]
Re^3: CSV nightmare (utf8 w/ csv_xs) by ikegami (Patriarch) on Jun 03, 2008 at 10:36 UTC
It's already been covered that it should be `open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!";` [download] or more precisely, `open my $fh, "<:raw:encoding(ucs-2le):crlf:utf8", $file or die "$file: $!"; read($fh, my $bom='', 1);` [download] And no, it doesn't work. Not if the data contains any non-ASCII characters, at least, but that's the whole point of this exercise. The UTF8 flag gets turned off, so the UTF-8 encoding of the characters is treated as iso-latin-1. For example, if a field contains <"é">, Text::CSV_XS returns the two characters <Ã©> instead of <é>. (I'm using angled brackets to quote to avoid confusion with the double-quotes in the CSV file.) For example, if a field contains <"♠">, Text::CSV_XS returns the three characters <â™£> instead of <♠>. The flag needs to be reinstated, so it should be: `use Encode qw( _utf8_on ); my $csv = Text::CSV_XS->new ({ binary => 1 }); # UTF-16 or UCS-2 file with BOM and CRLF or LF line endings. open my $fh, "<:raw:encoding(utf-16):crlf:utf8", $file or die "$file: $!"; while (my $row = $csv->getline ($fh)) { # Fix inability of CSV_XS to handle UTF8 strings. _utf8_on($_) for @$row; print $row->[4]; }` [download] There is at least one other problem with treating characters encoded using UTF-8 no differently then characters encoded using iso-latin-1 as Text::CSV_XS does. If any of `eol`, `sep_char`, etc is passed a string with the UTF8 flag off and it contains a character in [\x80-\xFF], Text::CSV_XS can generate false positives. However, this is unlikely to ever happen. Text::CSV might soon be extended with a layer that deals with encodings I don't see the point, since Text::CSV doesn't open any file handles. All it needs to do is respect the UTF8 flag on strings it receives via `getline`, `eol`, `sep_char`, etc. Currently (well, 0.34 and presumably 0.45), it ignores it.	[reply] [d/l] [select]
Re^4: CSV nightmare by Tux (Canon) on Jun 03, 2008 at 15:10 UTC
Re^5: CSV nightmare (utf8 w/ csv_xs) by ikegami (Patriarch) on Jun 03, 2008 at 19:26 UTC
Re^5: CSV nightmare by ikegami (Patriarch) on Jun 04, 2008 at 23:19 UTC
Some notes below your chosen depth have not been shown here
Re: CSV nightmare by dragonchild (Archbishop) on Jun 03, 2008 at 13:14 UTC
Text::xSV (written by our own tilly) should solve all your problems. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply]
Re: CSV nightmare by Tux (Canon) on Jun 04, 2008 at 19:03 UTC
Could you give it a go with Text::CSV_XS 0.50, which I just uploaded? It now promotes the UTF-8 setting you got from the file-handle encoding to the fields in `getline ()` and `parse ()` Enjoy, Have FUN! H.Merijn	[reply] [d/l] [select]