This does not:$ perl -e 'print substr("a\x{2322}bcd", 0, 3), "\n";' | hexdump 00000000 61 e2 8c a2 62 0a
Neither does this:$ perl -e 'print "a\x{2322}bcd\n"' > uni-file $ perl -ne 'print substr($_,0,3), "\n"' uni-file | hexdump 00000000 61 e2 8c 0a
But this does:$ perl -ne 'utf8::upgrade($_); > print substr($_,0,3), "\n"' uni-file | > hexdump 00000000 61 e2 8c 0a
However, I can't use binmode in my program. In my program I use IO::File and Text::CSV_XS to read a file with cp1252. The line:$ perl -ne 'BEGIN{binmode(STDIN,":utf8"); > print substr($_,0,3), "\n"' uni-file | > hexdump 00000000 61 e2 8c a2 62 0a
does indeed read the input, and converts the octet string into utf8, as expected. A few lines later, I need to substr() a column:$io->open($in_file, "<:raw:encoding(cp1252)") || die(...);
but here the string is not recognized as utf8. The substr function cuts right in the middle of a utf8 multibyte. Even adding "utf8::upgrade($str)" doesn't help. There must be something obvious I'm missing here...while (!$io->eof) { $cols = $csvin->getline($io); ... $str = $col[0]; print substr($str,0,3), "\n"; # XXX cuts through utf8 multibyte! }
In reply to substr on utf8-strings by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |