Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Text::CSV on Unicode file

by dd-b (Monk)
on Jun 07, 2017 at 06:32 UTC ( [id://1192242] : perlquestion . print w/replies, xml ) Need Help??

dd-b has asked for the wisdom of the Perl Monks concerning the following question:

When I try to read the header line of a CSV file that I opened with Unicode encoding (and which actually has some non-ASCII in it, though not I think in the header line) I get the error:

Strings with code points over 0xFF may not be mapped into in-memory fi +le handles readline() on closed filehandle $h at /usr/lib/perl5/site_perl/5.22/i6 +86-cygwin-threads-64int/Text/CSV_XS.pm line 830.

*I'm* not doing file IO on any strings, and the code line given is in Text::CSV not my code.

There's a "Unicode" section to the doc for Text::CSV and I think I did what it said. I verified that turning *off* unicode for that file eliminates this error message. (Since there are actual non-ASCII characters in the file that must be read and comprehended later that's not a long-term solution.)

Any ideas? The symptoms look like Unicode just doesn't work, but the Unicode section in the docs seems pretty clearly to be based on the assumption that it does, and it must be pretty commonly used.

Not much to my code so far, just the start of this bit. It's the $csv->header($ifh) call throws this error.

#! /usr/bin/env perl # Read the export from Thumbs Plus including keywords from filename gi +ven. use warnings; use strict; use utf8; # so literals and identifiers can be in UTF-8 use v5.12; # or later to get "unicode_strings" feature use warnings qw(FATAL utf8); # fatalize encoding glitches #use open qw(:std :utf8); # undeclared streams in UTF-8 #use charnames qw(:full :short); # unneeded in v5.16 use Text::CSV; use Data::Dumper; # debug my $csv = Text::CSV->new ( { binary => 1 } ) or die "Cannot use CSV in: ".Text::CSV->error_diag(); print $ARGV[0],"\n"; open my $ifh, "<:encoding(UTF-8)", $ARGV[0] or die "$ARGV[0]: $!"; print "Point a\n"; # Returns "the instance" -- of what? Do I care? my $thingie = $csv->header ($ifh); print "Point b\n"; print Dumper($csv), "\n";

The first three lines (long lines) of the input file are:

$ head -3 /cygdrive/p/Photos/ThumbsPlus/Thumbs.txt "Volume.label","Volume.serialno","Volume.vtype","Volume.netname","Volu +me.filesystem","Path.name",,"Thumbnail.checksum","Thumbnail.width","T +humbnail.height","Thumbnail.horiz_res","Thumbnail.vert_res","Thumbnai +l.colortype","Thumbnail.colordepth","Thumbnail.gamma","Thumbnail.thum +bnail_width","Thumbnail.thumbnail_height","Thumbnail.thumbnail_type", +"Thumbnail.thumbnail_size","Thumbnail.name","Thumbnail.metric1","Keyw +ords.pkeywords", PCD0138,,4037894171,5,\\ddb\r$,CDFS,PHOTO_CD\IMAGES,1,0,0,"1996-09-30T +21:38:57","2002-10-12T00:29:25",3368960,2147483648,512,768,0,0,0,24,0 +,68,100,518,336,IMG0002.PCD,m0000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +00000000000000,b00000000000000000000000000000000,,0,";", PCD0138,,4037894171,5,\\ddb\r$,CDFS,PHOTO_CD\IMAGES,1,0,0,"1996-09-30T +21:38:57","2002-10-12T00:29:25",3354624,2147483648,512,768,0,0,0,24,0 +,68,100,518,336,IMG0003.PCD,m0000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +00000000000000,b00000000000000000000000000000000,,0,";",

Replies are listed 'Best First'.
Re: Text::CSV on Unicode file
by Tux (Canon) on Jun 07, 2017 at 07:10 UTC

    There is not enough information to answer this question (yet).

    First of all, all you CSV lines end with a comma. Though this is valid CSV, it is not for a header line, so I would expect that to generate this error:

    # CSV_XS ERROR: 1012 - INI - the header contains an empty field @ rec +1 pos 0

    Even if there would be no trailing comma's, this error should happen, as there is an empty field between "Path.name" and "Thumbnail.checksum". The documentation is quite clear about that:

    If the header is empty, contains more than one unique separato +r out of the allowed set, contains empty fields, or contains identica +l fields (after folding), it will croak with error 1010, 1011, 1012, or +1013 respectively.

    Secondly, you'll get your errors more reliable if you'd pass the auto_diag option:

    my $csv = Text::CSV ({ binary => 1, auto_diag => 1 });

    Thirdly, your data might contain UTF-8 encoded information, but your example does not, so we'd need to know what type of data is where in de header line.

    We also want to know what versions of Text::CSV and - if installed - Text::CSV_XS you are using in order to try to reproduce and/or explain.

    My guess is that if you add auto_diag and make sure there are no empty fields in the header line, your code might work.


    Enjoy, Have FUN! H.Merijn

      The CSV is what's produced by Thumbs Plus exporting the database. I can potentially massage it of course -- but the reason I'm reading it with Text::CSV was to *avoid* writing my own code to parse it. (it's a one-shot hack, it only has to work with *my particular* data, not in general, though, so hacking the input file is entirely on the table.) Taking off final commas is hard (something like 270,000 lines) but I can add content to the header to make it a "field" of no meaning. (And I'm only doing this because of repeated failures to get ODBC access to the database to work in anything except Libre Office -- which won't export it, and takes too long, three days and counting, to put it in a dialog box widget I might be able to cut and paste out of).

      The error happens immediately, when I've only called the function to read the header line, so later lines shouldn't matter that I can see; and the time it takes doesn't suggest it's reading the whole file (270,000 lines of the size shown in my sample). Should I be making calls to define fields myself instead of reading the header, maybe? I'm just trying to do what seems the simple, direct way to use this code, if I'm guessing that wrong I'm open to change.

      Adding auto_diag did not add anything at all to the output.

      Versions are current:

      ---------------------------------------------------------------------- +--- (no description) I/IS/ISHIGAKI/Text-CSV-1.95.tar.gz /usr/lib/perl5/site_perl/5.22/Text/CSV.pm Installed: 1.95 CPAN: 1.95 up to date Kenichi Ishigaki (ISHIGAKI) ishigaki@cpan.org ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ cpan -D Text::CSV_XS Loading internal null logger. Install Log::Log4perl for logging messag +es Reading '/home/ddb/.cpan/Metadata' Database was generated on Wed, 07 Jun 2017 18:17:02 GMT Text::CSV_XS ---------------------------------------------------------------------- +--- (no description) H/HM/HMBRAND/Text-CSV_XS-1.29.tgz /usr/lib/perl5/site_perl/5.22/i686-cygwin-threads-64int/Text/C +SV_XS.pm Installed: 1.29 CPAN: 1.29 up to date H.Merijn Brand (HMBRAND) h.m.brand@xs4all.nl
      And adding "nothing.nothing" to the end of the header line (so there isn't that trailing comma, but without changing the number of commas) made zero difference to the output, still get the exact same error.

        You can auto-generate headers for empty fields:

        my @hdr = $csv->header ($fh, { munge_column_names => sub { state $i; $_ || "nothing.".$i++ }});

        That would result in these headers:

        Volume.label Volume.serialno Volume.vtype Volume.netname Volume.filesystem Path.name nothing.0 Thumbnail.checksum Thumbnail.width Thumbnail.height Thumbnail.horiz_res Thumbnail.vert_res Thumbnail.colortype Thumbnail.colordepth Thumbnail.gamma Thumbnail.thumbnail_width Thumbnail.thumbnail_height Thumbnail.thumbnail_type Thumbnail.thumbnail_size Thumbnail.name Thumbnail.metric1 Keywords.pkeywords nothing.1

        Enjoy, Have FUN! H.Merijn

      Okay, something is weirdly amiss. The error still occurs with only one data line (which looks well-behaved to me) and the header line modified to give names to fields previously empty (so I didn't change the number of commas, but put something between the ones that had nothing there).

      "Volume.label","Volume.serialno","Volume.vtype","Volume.netname","Volu +me.filesystem","Path.name","other.nothing","Thumbnail.checksum","Thum +bnail.width","Thumbnail.height","Thumbnail.horiz_res","Thumbnail.vert +_res","Thumbnail.colortype","Thumbnail.colordepth","Thumbnail.gamma", +"Thumbnail.thumbnail_width","Thumbnail.thumbnail_height","Thumbnail.t +humbnail_type","Thumbnail.thumbnail_size","Thumbnail.name","Thumbnail +.metric1","Keywords.pkeywords","nothing.nothing" PCD0138,,4037894171,5,\\ddb\r$,CDFS,PHOTO_CD\IMAGES,1,0,0,"1996-09-30T +21:38:57","2002-10-12T00:29:25",3368960,2147483648,512,768,0,0,0,24,0 +,68,100,518,336,IMG0002.PCD,m0000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +000000000000000000000000000000000000000000000000000000000000000000000 +00000000000000,b00000000000000000000000000000000,,0,";",lose
      My code remains the same as the original post, and the error message remains precisely the same. Here's the output, along with a bit more info about the file:
      ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ ./readtpexport.pl play-thumbs.txt play-thumbs.txt Point a Strings with code points over 0xFF may not be mapped into in-memory fi +le handles readline() on closed filehandle $h at /usr/lib/perl5/site_perl/5.22/i6 +86-cygwin-threads-64int/Text/CSV_XS.pm line 830. at ./readtpexport.pl line 25. ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ file play-thumbs.txt play-thumbs.txt: UTF-8 Unicode (with BOM) text, with very long lines ddb@DDB4 /cygdrive/p/work/tpdbfix/app $ ls -l play-thumbs.txt -rwxrwxr-x 1 ddb Unix_Group+1001 874 Jun 7 18:41 play-thumbs.txt

        Can you make those two lines of CSV available in a .tgz or a .zip somewhere? I cannot reproduce this on my Linuxes, whatever I try

        Feel free to make this an issue on Text::CSV_XS' issues. Even if it is cygwin related, this should not happen.


        Enjoy, Have FUN! H.Merijn
Re: Text::CSV on Unicode file
by dd-b (Monk) on Jun 08, 2017 at 04:17 UTC
    Just for drill, I copied the script and the small test file to a different system to see if the problem reproduces.

    It does; original environment is Cygwin, which I think anybody familiar with it is always just a *little* suspicious of, but copying the two test files to a FreeBSD box, the problem reproduces exactly.

    [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ cpan -D Text::CSV +_XS Loading internal null logger. Install Log::Log4perl for logging messag +es Reading '/home/ddb/.cpan/Metadata' Database was generated on Wed, 07 Jun 2017 21:41:02 GMT Text::CSV_XS ---------------------------------------------------------------------- +--- (no description) H/HM/HMBRAND/Text-CSV_XS-1.29.tgz /usr/local/lib/perl5/site_perl/mach/5.24/Text/CSV_XS.pm Installed: 1.29 CPAN: 1.29 up to date H.Merijn Brand (HMBRAND) h.m.brand@xs4all.nl [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ cpan -D Text::CSV Loading internal null logger. Install Log::Log4perl for logging messag +es Reading '/home/ddb/.cpan/Metadata' Database was generated on Wed, 07 Jun 2017 21:41:02 GMT Text::CSV ---------------------------------------------------------------------- +--- (no description) I/IS/ISHIGAKI/Text-CSV-1.95.tar.gz /usr/local/lib/perl5/site_perl/Text/CSV.pm Installed: 1.95 CPAN: 1.95 up to date Kenichi Ishigaki (ISHIGAKI) ishigaki@cpan.org [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ ls -l play-thumbs +.txt readtpexport.\ pl -rwxrwxr-x 1 ddb ddb 874 Jun 7 18:41 play-thumbs.txt -rwxrwxr-x 1 ddb ddb 876 Jun 7 21:05 readtpexport.pl [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ file play-thumbs. +txt play-thumbs.txt: UTF-8 Unicode (with BOM) text, with very long lines [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ ./readtpexport.pl + play-thumbs.txt play-thumbs.txt Point a Strings with code points over 0xFF may not be mapped into in-memory fi +le handles readline() on closed filehandle $h at /usr/local/lib/perl5/site_perl/m +ach/5.24/Text/CSV_\ XS.pm line 830. at ./readtpexport.pl line 25. [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$
      [ddb@playpen ~/smbshare/Documents/work/tpdbfix/app]$ file play-thumbs. +txt play-thumbs.txt: UTF-8 Unicode (with BOM) text, with very long lines

      That (with BOM) was the trigger! I can now reproduce

      I have fixed this for version 1.31. I will need to add some tests for that before I release.

      With the current versions you just don't use :encoding(utf-8) on open, as the headers method then doesn't recognize the BOM as bytes.


      Enjoy, Have FUN! H.Merijn
        Outstanding! Glad I finally was able to describe it precisely enough that you could find it. I assume this means the request for a zipped copy of stuff is now irrelevant? If still useful I could certainly do it. My own further investigation discovered bugs I still have to report on Thumbs Plus -- the database export CSV header and body lines aren't compatible. Sigh. But that's not in any way *your* problem :-) . But I've got a new Text::CSV one, too, that I'll post shortly in a new thread.