substr on utf8-strings

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Perl 5.8.2. I have trouble getting substr to recognize utf8 input. This works:

$ perl -e 'print substr("a\x{2322}bcd", 0, 3), "\n";' | hexdump
00000000  61 e2 8c a2 62 0a
[download]

This does not:

$ perl -e 'print "a\x{2322}bcd\n"' > uni-file
$ perl -ne 'print substr($_,0,3), "\n"' uni-file | hexdump
00000000  61 e2 8c 0a
[download]

Neither does this:

$ perl -ne 'utf8::upgrade($_);
> print substr($_,0,3), "\n"' uni-file |
> hexdump
00000000  61 e2 8c 0a
[download]

But this does:

$ perl -ne 'BEGIN{binmode(STDIN,":utf8");
> print substr($_,0,3), "\n"' uni-file |
> hexdump
00000000  61 e2 8c a2 62 0a
[download]

However, I can't use binmode in my program. In my program I use IO::File and Text::CSV_XS to read a file with cp1252. The line:

$io->open($in_file, "<:raw:encoding(cp1252)")  ||  die(...);
[download]

does indeed read the input, and converts the octet string into utf8, as expected. A few lines later, I need to substr() a column:

while (!$io->eof) {
   $cols = $csvin->getline($io);
   ...
   $str = $col[0];
   print substr($str,0,3), "\n";   # XXX cuts through utf8 multibyte!
}
[download]

but here the string is not recognized as utf8. The substr function cuts right in the middle of a utf8 multibyte. Even adding "utf8::upgrade($str)" doesn't help. There must be something obvious I'm missing here...

Comment on substr on utf8-strings Select or Download Code

Replies are listed 'Best First'.
Re: substr on utf8-strings by pg (Canon) on Dec 24, 2003 at 20:43 UTC
I don't think there is any bug, and in fact your code is very suspicious. It does not seem to me that you have utf8 flag turned on for your files. The mention of your attempt to open file with binmode, simply does not fit in the context at all. Why binmode has anything to do with what you are attempting? Hope those little sample code help you understand how Perl handles unicode file IO. The first example works as expected: `use strict; use warnings; open(FILE, ">:utf8", "a.txt"); print FILE "a\x{2322}bcd\n"; close FILE; open(FILE, "<:utf8", "a.txt"); while (my $line = <FILE>) { print substr($line, 0, 3); } close FILE;` [download] This second example does not work, and it should not work: (The only difference here is that files are opened without :utf8) `use strict; use warnings; open(FILE, ">", "a.txt"); print FILE "a\x{2322}bcd\n"; close FILE; open(FILE, "<", "a.txt"); while (my $line = <FILE>) { print substr($line, 0, 3); } close FILE;` [download]	[reply] [d/l] [select]
Re: Re: substr on utf8-strings by Anonymous Monk on Dec 25, 2003 at 23:37 UTC
Just learning the new unicode in perl (jumping from 5.6.1 to 5.8.2), I was indeed confused with the concepts. Investing more does reveal a bug, I believe, but it could also be another misunderstanding. Below is the shortest program that shows the behaviour. I do think it is a bug in the Text::CSV_XS module together with perl 5.8.2. If I run the program without options, using simple IO::File methods, the string is recognized as utf8. Running it with '-c, using CSV_XS, the string is converted correctly, but not recognized or marked as utf8. Running it with '-cf' to force the utf8 flag on, results in decent behaviour again. The program expects as input a little file encoded in latin1 or windows-1252. I tested with a file with e-acute/e-acute/n/newline. #!/usr/bin/perl # expect as input file named "file" with these contents: # $ hexdump file # 00000000 e9 e9 6e 0a use Getopt::Std; use IO::File; use Text::CSV_XS; use Encode; binmode(STDOUT, ":utf8"); # -c = use CSV_XS instead of simple IO # -f = force utf8 flag on getopts('cf'); my $io = IO::File->new(); $io->open("file", "<:raw:encoding(cp1252)") \|\| die("$0: open inputfile: $!\n"); my $csvin = Text::CSV_XS->new({ 'binary' => 1 }); if ($opt_c) { $cols = $csvin->getline($io); $s = $$cols[0]; } else { $s = <$io>; chomp($s); } Encode::_utf8_on($s) if ($opt_f); print "[$s] ", utf8::is_utf8($s) ? "Is UTF8" : "NOT utf8", " ", utf8::valid($s) ? "Is valid" : "NOT valid"; print "\nLENGTH: ", length($s), " SUBSTR3: [", substr($s,0,3), "]", " SPLIT: [", join(" ", split(//, $s)), "]\n"; [download]	[reply] [d/l]
Re: substr on utf8-strings by ysth (Canon) on Dec 24, 2003 at 18:01 UTC
From your examples, this isn't a problem with substr but with making sure the utf8 flag is on the data. I suspect some XS code (either the getline or the ...) not setting the flag correctly. (What class is that getline in? What class is $io?) You can put Encode::is_utf8($scalar) checks through your code to figure out where its being lost, and if necessary, do Encode::_utf_on. Update: the suggestion of using _utf_on is a temporary workaround and is not a substitue for reporting a bug to the author if a module is not correctly handling UTF8 input.	[reply]
Re: substr on utf8-strings by Roy Johnson (Monsignor) on Dec 24, 2003 at 15:57 UTC
I'm thinking bug. This little program works under ActivePerl 5.8.1. `#!perl use strict; use warnings; $_ = "a\x{2322}bcd\n"; print substr($_, 0, 3), "\n";` [download] output is: `Wide character in print at try2.pl.txt line 6. aΓîób` [download] Or at least that's as close as PerlMonks can display. The important thing is that it prints from a to b. The PerlMonk `tr///` Advocate	[reply] [d/l] [select]