My script will write a new file that is exactly the same as the input file. Now if I change the ASCII letter 'e' to e with acute (U+00E9),this is field 1,this is field 2, this is field 3
guess what, it still works the same as with just ASCII. That is the output file is exactly the same as the input file. The output file created with my script has no double quotes around any field. It looks the same as the input file. But, if I add one Japanese character to any one of the items, that item with the Japanese character will have double quotes around it in the new file. So the inconsistency is even worse then I expected as the command "quote_space => 0 does work for some characters above 0x7F but not for all characters. My data file is UTF8 so the e acute is two bytes in UTF8 where as the Japanese character is 3 bytes but again, I'd like to think that all UTF8 data is treated the same. In conclusion, I want properly formatted CSV, I expect double quotes around strings when needed but my testing shows that there is a lot of inconsistency with the use of quote_space => 0 depending on the type of characters in the string.this is fiéld 1,this is fiéld 2, this is fiéld 3
Did some more testing and it appears that the characters that "quote_space => 0" works properly are printable characters in the range 0x00 - 0xFF (Basic Latin and Latin 1 Supplement). Fields with a character above 0x0100 (starts with Latin Extended A) will always get double quotes around them regardless if needed or not. So the function quote_space => 0 stops working as expected with characters starting at 0x0100.
In reply to Re^4: CSV_XS and UTF8 strings
by beerman
in thread CSV_XS and UTF8 strings
by beerman
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |