in reply to Setting UTF-8 mode on filehandle reads?

If you were using Perl 5.8, I'd suggest pushing an encoding layer when you opened the file (or after with binmode). As you're not, I won't.

Here's a quick script that reads a file line-by-line and uses pack to set the UTF-8 flag on each string read in. After that flag is set, character semantics work as expected for wide characters that were read in from the file:

use utf8; use CGI::Carp qw(fatalsToBrowser); print "Content-type: text/html; charset=utf-8\n\n"; open(FILE, "<", "/path/to/utf8/file.txt") || die "$!"; print "<pre>\n"; while(<FILE>) { chomp; $_ = set_utf($_); my $len = length($_); # count of chars not bytes print "$_", ' ' x (72 - $len), "|\n"; } print "</pre>\n"; sub set_utf { return pack "U0a*", join '', @_; }

I fashioned the script as a CGI script so that you can view the output in your browser - which understands UTF-8 characters (whereas your TTY might not). Given a UTF-8 text file with lines less than 80 characters, this should pad each line out to 80 characters with spaces and then append a '|'. If character semantics are not in force, the length will count bytes rather than characters and the '|'s won't line up.

Replies are listed 'Best First'.
Re:^2 Setting UTF-8 mode on filehandle reads?
by ph0enix (Friar) on Dec 06, 2002 at 12:32 UTC
     open(FILE, "<", "/path/to/utf8/file.txt") || die "$!";

    Why don't use perl 5.8.0 and simply specify IO 'layer' when using three-argument form of open?

    open(FILE, '<:utf8', '/path/to/utf8/file.txt') || die $!;
      Why don't use perl 5.8.0...

      um, because the original poster was looking for a 5.6.1 solution. (And if you actually read my reply you'd see that I suggested exactly what you propose in the first line!)