Altering Text::CSV to handle Unicode data

princepawn has asked for the wisdom of the Perl Monks concerning the following question:

The manual page for Text::CSV does state:

This module is based upon a working definition of CSV format which may not be the most general. Allowable characters within a CSV field include 0x09 (tab) and the inclusive range of 0x20 (space) through 0x7E (tilde).

Now let's look at how this part of the spec is implemented:

# ~LINE 308

 } elsif ($$line_ref =~ /^[\t\040-\176]/) {

      # a tab, space, or printable...                                 
+                                                              
      $$piece_ref .= substr($$line_ref, 0 ,1);
      substr($$line_ref, 0, 1) = '';
[download]

The first thing I notice is that he says 0x20 (space) through 0x7E (tilde). but the code has /^[\t\040-\176]/.

I believe this is because hex 20 equals 40 in some other number system. But is the number system octal or decimal and how would I know?
How can I make this regular expression accept unicode characters? I have some CSV files with Unicode characters which Text::CSV barfs on..
thank God there is a Text::CSV so I could track this down... There is a faster module with the same API Text::CSV_XS that I would have had no hope of finding the problem. Then I would've had to properly parse CSV files on my own. eek.

Carter's compass: I know I'm on the right track when by deleting something, I'm adding functionality

Comment on Altering Text::CSV to handle Unicode data Select or Download Code

Replies are listed 'Best First'.
Re: Altering Text::CSV to handle Unicode data by Ovid (Cardinal) on Jul 15, 2003 at 18:25 UTC
The solution is simple. Don't use Text::CSV. It only handles ASCII data and it doesn't handle embedded newlines. You're already using Text::CSV_XS and you know it's faster, so why not use it? If you want it to match non-ascii (and the newlines!), just set binary mode to true in the attributes in the constructor. `my $csv = Text::CSV_XS->new({binary => 1});` Cheers, Ovid Looking for work. Here's my resume. Will work for food (plus salary). New address of my CGI Course.	[reply] [d/l]
Re: Altering Text::CSV to handle Unicode data by halley (Prior) on Jul 15, 2003 at 18:44 UTC
The first thing I notice is that he says 0x20 (space) through 0x7E (tilde). but the code has /^\t\040-\176/. Those are identical ranges, one in hexadeximal, and the other in octal. (He does also allow `\t` tabs as space, though.) `printf("%d %d %d \n", 040, 32, 0x20); printf("%d %d %d \n", 0176, 126, 0x7E);` [download] A leading zero (040) means the number is read as octal. A leading zero-ecks (0x20) means the number is read as hexadecimal. I'm of the opinion that computer-gradeschoolers should learn some very clear "landmark" numbers to help them understand different codebases. (I also plan to teach my daughter to memorize sixteen powers of two in decimal before she leaves the fourth grade. Inchworm, inchworm, measuring the merigolds...) binary 00001001 = 011 = 9 = 0x9 = nine binary 00001100 = 014 = 12 = 0xC = a dozen binary 00111111 = 077 = 63 = 0x3F = six bits full (8x8-1) binary 01111111 = 0177 = 127 = 0x7F = seven bits full (16x8-1) binary 11111111 = 0377 = 255 = 0xFF = eight bits full (16x16-1) binary 1111111111111111 = 0177777 = 65535 = 0xFFFF = sixteen bits full (256x256-1) -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Altering Text::CSV to handle Unicode data by bobn (Chaplain) on Jul 15, 2003 at 18:05 UTC
In many contexts, including this, a leading \0 means octal. Use \x to force hex. --Bob Niederman, http://bob-n.com	[reply]