Trying to understand behavior of split and perl in general with UTF-8

hdv.jadev has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

After a hiatus of some years (!) I've returned to my programming roots, i.e. perl. To help someone out with a large amount of texts containing non-ASCII characters I am writing a little script (I thought it would be easy...). However it seems I do not really understand perl's way of working with UTF-8 anymore. I've made a simplified example illustrating the problem:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

my %german_chars = (
  'Ä' => 'Ae', 'Ö' => 'Oe', 'Ü' => 'Ue', 'ä' => 'ae',
  'ö' => 'oe', 'ü' => 'ue', 'ß' => 'ss'
);

#For each argument apply the conversions selected
for my $old_name (@ARGV) {
  my $new_name = do_german($old_name);
  print "Old name: '$old_name'\nNew name: '$new_name'\n";
}

#Replace German characters with their ASCII equivalent
sub do_german {
  my $string = shift;  
  #Replace all German characters in the given string
  my @string_array = split //, $string;
  foreach (@string_array) {
    print "###Before replacement### $_\n";
    $_ = $german_chars{$_} if $german_chars{$_};
    print "###After replacement ### $_\n";
  }
  $string = join '',@string_array;
  return $string;
}
[download]

When I run this, this is what comes out:

./test.pl für_elise
###Before replacement### f
###After replacement ### f
###Before replacement### &#65533;
###After replacement ### &#65533;
###Before replacement### &#65533;
###After replacement ### &#65533;
###Before replacement### r
###After replacement ### r
###Before replacement### _
###After replacement ### _
###Before replacement### e
###After replacement ### e
###Before replacement### l
###After replacement ### l
###Before replacement### i
###After replacement ### i
###Before replacement### s
###After replacement ### s
###Before replacement### e
###After replacement ### e
Old name: 'für_elise'
New name: 'für_elise'
[download]

Clearly what I wanted was "fuer_elise".

I know about split and character length, so I do understand why split turns the u umlaut into 2 separate characters. I've been playing around with "use Encoding", "use open" and "binmode", in an attempt to solve this, but I still do not really grasp what I am doing wrong. Especially because outside of the subroutine the "special" character is interpreted correctly.

I know I could use some module, but that wouldn't help me fill in the gap in my knowledge. Could one of you please enlighten me?

Your help is appreciated. Regards, Hans

Comment on Trying to understand behavior of split and perl in general with UTF-8 Select or Download Code

Replies are listed 'Best First'.
Re: Trying to understand behavior of split and perl in general with UTF-8 by Yary (Pilgrim) on Jun 17, 2010 at 18:29 UTC
You're on the right track. "use utf8" tells perl that you have utf8 chars in your source. You do need "use Encoding", and then you also need to decode what you read and encode what you write. Follow up with these commands `perldoc perluniintro perldoc perlunitut` [download]	[reply] [d/l]
Re^2: Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev (Novice) on Jun 17, 2010 at 20:20 UTC
Ah, I think I know where I went of the cliff now. "use utf8" is only for the source itself, not for any character streams coming in or going out. For that I have to explicitly state the encoding. It seems I did not truly realize that perl has an internal way of using UTF-8 that has nothing to do with what encoding the shell is using. Am I correct in this assumption? With what you and almut told me I got the example code working in a few seconds, so that gives me some hope I do understand at least somewhat better now. Thanks very much for helping me renew my mastery of perl!	[reply]
Re^3: Trying to understand behavior of split and perl in general with UTF-8 by almut (Canon) on Jun 17, 2010 at 20:42 UTC
It seems I did not truly realize that perl has an internal way of using UTF-8 that has nothing to do with what encoding the shell is using. Perl's internal way of representing unicode characters is (almost) UTF-8, too. But for most practical purposes from the user perspective, it helps to ignore this implementation detail¹, and just properly decode your inputs and encode your outputs. ___ ¹ other languages have chosen different internal formats for unicode strings, e.g. Python uses UCS-2 or UCS-4 (build-time option).	[reply]
Re^4: Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev (Novice) on Jun 18, 2010 at 11:02 UTC
Re^3: Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev (Novice) on Jul 22, 2010 at 12:16 UTC
Sorry it took so long to get back at this. Holidays... Anyway, just to let you know. Yesterday the person I wrote the script for successfully converted about 2 GB worth of plain-text files with old research data to a new database format. Thanks to your help it all went smoothly.	[reply]
Re: Trying to understand behavior of split and perl in general with UTF-8 by almut (Canon) on Jun 17, 2010 at 18:49 UTC
For one, you'd need to decode the UTF-8 encoded `@ARGV` input into Perl's internal unicode representation. Similarly, encode any unicode output back to UTF-8 for the terminal. `... use Encode; binmode STDOUT, ":utf8"; # for output for my $old_name (map decode('UTF-8', $_), @ARGV) { ...` [download] Instead of explicitly decoding `@ARGV` input, you could also use `-CA` (see perlrun). The problem with this option is, though, that you can't put it in the shebang line (error: `"Too late for "-CA" option at..."`), but only on the command line like `perl -CA yourscript.pl ...` Update: actually, `#!/usr/bin/perl -CA` finally does seem to work, too, as of Perl 5.10.1.	[reply] [d/l] [select]