in reply to substr on utf8-strings

I don't think there is any bug, and in fact your code is very suspicious. It does not seem to me that you have utf8 flag turned on for your files. The mention of your attempt to open file with binmode, simply does not fit in the context at all. Why binmode has anything to do with what you are attempting?

Hope those little sample code help you understand how Perl handles unicode file IO. The first example works as expected:

use strict; use warnings; open(FILE, ">:utf8", "a.txt"); print FILE "a\x{2322}bcd\n"; close FILE; open(FILE, "<:utf8", "a.txt"); while (my $line = <FILE>) { print substr($line, 0, 3); } close FILE;

This second example does not work, and it should not work: (The only difference here is that files are opened without :utf8)

use strict; use warnings; open(FILE, ">", "a.txt"); print FILE "a\x{2322}bcd\n"; close FILE; open(FILE, "<", "a.txt"); while (my $line = <FILE>) { print substr($line, 0, 3); } close FILE;

Replies are listed 'Best First'.
Re: Re: substr on utf8-strings
by Anonymous Monk on Dec 25, 2003 at 23:37 UTC
    Just learning the new unicode in perl (jumping from 5.6.1 to 5.8.2), I was indeed confused with the concepts. Investing more does reveal a bug, I believe, but it could also be another misunderstanding.

    Below is the shortest program that shows the behaviour. I do think it is a bug in the Text::CSV_XS module together with perl 5.8.2.

    If I run the program without options, using simple IO::File methods, the string is recognized as utf8. Running it with '-c, using CSV_XS, the string is converted correctly, but not recognized or marked as utf8. Running it with '-cf' to force the utf8 flag on, results in decent behaviour again.

    The program expects as input a little file encoded in latin1 or windows-1252. I tested with a file with e-acute/e-acute/n/newline.

    #!/usr/bin/perl # expect as input file named "file" with these contents: # $ hexdump file # 00000000 e9 e9 6e 0a use Getopt::Std; use IO::File; use Text::CSV_XS; use Encode; binmode(STDOUT, ":utf8"); # -c = use CSV_XS instead of simple IO # -f = force utf8 flag on getopts('cf'); my $io = IO::File->new(); $io->open("file", "<:raw:encoding(cp1252)") || die("$0: open inputfile: $!\n"); my $csvin = Text::CSV_XS->new({ 'binary' => 1 }); if ($opt_c) { $cols = $csvin->getline($io); $s = $$cols[0]; } else { $s = <$io>; chomp($s); } Encode::_utf8_on($s) if ($opt_f); print "[$s] ", utf8::is_utf8($s) ? "Is UTF8" : "NOT utf8", " ", utf8::valid($s) ? "Is valid" : "NOT valid"; print "\nLENGTH: ", length($s), " SUBSTR3: [", substr($s,0,3), "]", " SPLIT: [", join(" ", split(//, $s)), "]\n";