Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
This does not:$ perl -e 'print substr("a\x{2322}bcd", 0, 3), "\n";' | hexdump 00000000 61 e2 8c a2 62 0a
Neither does this:$ perl -e 'print "a\x{2322}bcd\n"' > uni-file $ perl -ne 'print substr($_,0,3), "\n"' uni-file | hexdump 00000000 61 e2 8c 0a
But this does:$ perl -ne 'utf8::upgrade($_); > print substr($_,0,3), "\n"' uni-file | > hexdump 00000000 61 e2 8c 0a
However, I can't use binmode in my program. In my program I use IO::File and Text::CSV_XS to read a file with cp1252. The line:$ perl -ne 'BEGIN{binmode(STDIN,":utf8"); > print substr($_,0,3), "\n"' uni-file | > hexdump 00000000 61 e2 8c a2 62 0a
does indeed read the input, and converts the octet string into utf8, as expected. A few lines later, I need to substr() a column:$io->open($in_file, "<:raw:encoding(cp1252)") || die(...);
but here the string is not recognized as utf8. The substr function cuts right in the middle of a utf8 multibyte. Even adding "utf8::upgrade($str)" doesn't help. There must be something obvious I'm missing here...while (!$io->eof) { $cols = $csvin->getline($io); ... $str = $col[0]; print substr($str,0,3), "\n"; # XXX cuts through utf8 multibyte! }
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: substr on utf8-strings
by pg (Canon) on Dec 24, 2003 at 20:43 UTC | |
by Anonymous Monk on Dec 25, 2003 at 23:37 UTC | |
|
Re: substr on utf8-strings
by ysth (Canon) on Dec 24, 2003 at 18:01 UTC | |
|
Re: substr on utf8-strings
by Roy Johnson (Monsignor) on Dec 24, 2003 at 15:57 UTC |