in reply to Text::CSV on Unicode file

There is not enough information to answer this question (yet).

First of all, all you CSV lines end with a comma. Though this is valid CSV, it is not for a header line, so I would expect that to generate this error:

# CSV_XS ERROR: 1012 - INI - the header contains an empty field @ rec +1 pos 0

Even if there would be no trailing comma's, this error should happen, as there is an empty field between "Path.name" and "Thumbnail.checksum". The documentation is quite clear about that:

If the header is empty, contains more than one unique separato +r out of the allowed set, contains empty fields, or contains identica +l fields (after folding), it will croak with error 1010, 1011, 1012, or +1013 respectively.

Secondly, you'll get your errors more reliable if you'd pass the auto_diag option:

my $csv = Text::CSV ({ binary => 1, auto_diag => 1 });

Thirdly, your data might contain UTF-8 encoded information, but your example does not, so we'd need to know what type of data is where in de header line.

We also want to know what versions of Text::CSV and - if installed - Text::CSV_XS you are using in order to try to reproduce and/or explain.

My guess is that if you add auto_diag and make sure there are no empty fields in the header line, your code might work.


Enjoy, Have FUN! H.Merijn

Replies are listed 'Best First'.
Re^2: Text::CSV on Unicode file
by dd-b (Pilgrim) on Jun 07, 2017 at 20:42 UTC

    The CSV is what's produced by Thumbs Plus exporting the database. I can potentially massage it of course -- but the reason I'm reading it with Text::CSV was to *avoid* writing my own code to parse it. (it's a one-shot hack, it only has to work with *my particular* data, not in general, though, so hacking the input file is entirely on the table.) Taking off final commas is hard (something like 270,000 lines) but I can add content to the header to make it a "field" of no meaning. (And I'm only doing this because of repeated failures to get ODBC access to the database to work in anything except Libre Office -- which won't export it, and takes too long, three days and counting, to put it in a dialog box widget I might be able to cut and paste out of).

    The error happens immediately, when I've only called the function to read the header line, so later lines shouldn't matter that I can see; and the time it takes doesn't suggest it's reading the whole file (270,000 lines of the size shown in my sample). Should I be making calls to define fields myself instead of reading the header, maybe? I'm just trying to do what seems the simple, direct way to use this code, if I'm guessing that wrong I'm open to change.

    Adding auto_diag did not add anything at all to the output.

    Versions are current:

    ---------------------------------------------------------------------- +--- (no description) I/IS/ISHIGAKI/Text-CSV-1.95.tar.gz /usr/lib/perl5/site_perl/5.22/Text/CSV.pm Installed: 1.95 CPAN: 1.95 up to date Kenichi Ishigaki (ISHIGAKI) ishigaki@cpan.org ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ cpan -D Text::CSV_XS Loading internal null logger. Install Log::Log4perl for logging messag +es Reading '/home/ddb/.cpan/Metadata' Database was generated on Wed, 07 Jun 2017 18:17:02 GMT Text::CSV_XS ---------------------------------------------------------------------- +--- (no description) H/HM/HMBRAND/Text-CSV_XS-1.29.tgz /usr/lib/perl5/site_perl/5.22/i686-cygwin-threads-64int/Text/C +SV_XS.pm Installed: 1.29 CPAN: 1.29 up to date H.Merijn Brand (HMBRAND) h.m.brand@xs4all.nl
Re^2: Text::CSV on Unicode file
by dd-b (Pilgrim) on Jun 07, 2017 at 23:29 UTC
    And adding "nothing.nothing" to the end of the header line (so there isn't that trailing comma, but without changing the number of commas) made zero difference to the output, still get the exact same error.

      You can auto-generate headers for empty fields:

      my @hdr = $csv->header ($fh, { munge_column_names => sub { state $i; $_ || "nothing.".$i++ }});

      That would result in these headers:

      Volume.label Volume.serialno Volume.vtype Volume.netname Volume.filesystem Path.name nothing.0 Thumbnail.checksum Thumbnail.width Thumbnail.height Thumbnail.horiz_res Thumbnail.vert_res Thumbnail.colortype Thumbnail.colordepth Thumbnail.gamma Thumbnail.thumbnail_width Thumbnail.thumbnail_height Thumbnail.thumbnail_type Thumbnail.thumbnail_size Thumbnail.name Thumbnail.metric1 Keywords.pkeywords nothing.1

      Enjoy, Have FUN! H.Merijn
Re^2: Text::CSV on Unicode file
by dd-b (Pilgrim) on Jun 07, 2017 at 23:46 UTC

    Okay, something is weirdly amiss. The error still occurs with only one data line (which looks well-behaved to me) and the header line modified to give names to fields previously empty (so I didn't change the number of commas, but put something between the ones that had nothing there).

    "Volume.label","Volume.serialno","Volume.vtype","Volume.netname","Volu +me.filesystem","Path.name","other.nothing","Thumbnail.checksum","Thum +bnail.width","Thumbnail.height","Thumbnail.horiz_res","Thumbnail.vert +_res","Thumbnail.colortype","Thumbnail.colordepth","Thumbnail.gamma", +"Thumbnail.thumbnail_width","Thumbnail.thumbnail_height","Thumbnail.t +humbnail_type","Thumbnail.thumbnail_size","Thumbnail.name","Thumbnail +.metric1","Keywords.pkeywords","nothing.nothing" PCD0138,,4037894171,5,\\ddb\r$,CDFS,PHOTO_CD\IMAGES,1,0,0,"1996-09-30T +21:38:57","2002-10-12T00:29:25",3368960,2147483648,512,768,0,0,0,24,0 +,68,100,518,336,IMG0002.PCD,m0000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +00000000000000,b00000000000000000000000000000000,,0,";",lose
    My code remains the same as the original post, and the error message remains precisely the same. Here's the output, along with a bit more info about the file:
    ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ ./readtpexport.pl play-thumbs.txt play-thumbs.txt Point a Strings with code points over 0xFF may not be mapped into in-memory fi +le handles readline() on closed filehandle $h at /usr/lib/perl5/site_perl/5.22/i6 +86-cygwin-threads-64int/Text/CSV_XS.pm line 830. at ./readtpexport.pl line 25. ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ file play-thumbs.txt play-thumbs.txt: UTF-8 Unicode (with BOM) text, with very long lines ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ ls -l play-thumbs.txt -rwxrwxr-x 1 ddb Unix_Group+1001 874 Jun 7 18:41 play-thumbs.txt

      Can you make those two lines of CSV available in a .tgz or a .zip somewhere? I cannot reproduce this on my Linuxes, whatever I try

      Feel free to make this an issue on Text::CSV_XS' issues. Even if it is cygwin related, this should not happen.


      Enjoy, Have FUN! H.Merijn