Re: Setting UTF-8 mode on filehandle reads?

If you were using Perl 5.8, I'd suggest pushing an encoding layer when you opened the file (or after with binmode). As you're not, I won't.

Here's a quick script that reads a file line-by-line and uses pack to set the UTF-8 flag on each string read in. After that flag is set, character semantics work as expected for wide characters that were read in from the file:

  use utf8;
  use CGI::Carp qw(fatalsToBrowser);

  print "Content-type: text/html; charset=utf-8\n\n";

  open(FILE, "<", "/path/to/utf8/file.txt") || die "$!";

  print "<pre>\n";
  while(<FILE>) {
    chomp;
    $_ = set_utf($_);
    my $len = length($_);    # count of chars not bytes
    print "$_", ' ' x (72 - $len), "|\n";
  }
  print "</pre>\n";

  sub set_utf {
    return pack "U0a*", join '', @_;
  }
[download]

I fashioned the script as a CGI script so that you can view the output in your browser - which understands UTF-8 characters (whereas your TTY might not). Given a UTF-8 text file with lines less than 80 characters, this should pad each line out to 80 characters with spaces and then append a '|'. If character semantics are not in force, the length will count bytes rather than characters and the '|'s won't line up.

Comment on Re: Setting UTF-8 mode on filehandle reads? Download Code

Replies are listed 'Best First'.
Re:^2 Setting UTF-8 mode on filehandle reads? by ph0enix (Friar) on Dec 06, 2002 at 12:32 UTC
open(FILE, "<", "/path/to/utf8/file.txt") \|\| die "$!"; Why don't use perl 5.8.0 and simply specify IO 'layer' when using three-argument form of open? `open(FILE, '<:utf8', '/path/to/utf8/file.txt') \|\| die $!;`	[reply] [d/l]
Re: Re:^2 Setting UTF-8 mode on filehandle reads? by grantm (Parson) on Dec 06, 2002 at 18:14 UTC
Why don't use perl 5.8.0... um, because the original poster was looking for a 5.6.1 solution. (And if you actually read my reply you'd see that I suggested exactly what you propose in the first line!)	[reply]