After a hiatus of some years (!) I've returned to my programming roots, i.e. perl. To help someone out with a large amount of texts containing non-ASCII characters I am writing a little script (I thought it would be easy...). However it seems I do not really understand perl's way of working with UTF-8 anymore. I've made a simplified example illustrating the problem:
When I run this, this is what comes out:#!/usr/bin/perl use strict; use warnings; use utf8; my %german_chars = ( 'Ä' => 'Ae', 'Ö' => 'Oe', 'Ü' => 'Ue', 'ä' => 'ae', 'ö' => 'oe', 'ü' => 'ue', 'ß' => 'ss' ); #For each argument apply the conversions selected for my $old_name (@ARGV) { my $new_name = do_german($old_name); print "Old name: '$old_name'\nNew name: '$new_name'\n"; } #Replace German characters with their ASCII equivalent sub do_german { my $string = shift; #Replace all German characters in the given string my @string_array = split //, $string; foreach (@string_array) { print "###Before replacement### $_\n"; $_ = $german_chars{$_} if $german_chars{$_}; print "###After replacement ### $_\n"; } $string = join '',@string_array; return $string; }
./test.pl für_elise ###Before replacement### f ###After replacement ### f ###Before replacement### � ###After replacement ### � ###Before replacement### � ###After replacement ### � ###Before replacement### r ###After replacement ### r ###Before replacement### _ ###After replacement ### _ ###Before replacement### e ###After replacement ### e ###Before replacement### l ###After replacement ### l ###Before replacement### i ###After replacement ### i ###Before replacement### s ###After replacement ### s ###Before replacement### e ###After replacement ### e Old name: 'für_elise' New name: 'für_elise'
Clearly what I wanted was "fuer_elise".
I know about split and character length, so I do understand why split turns the u umlaut into 2 separate characters. I've been playing around with "use Encoding", "use open" and "binmode", in an attempt to solve this, but I still do not really grasp what I am doing wrong. Especially because outside of the subroutine the "special" character is interpreted correctly.
I know I could use some module, but that wouldn't help me fill in the gap in my knowledge. Could one of you please enlighten me?
Your help is appreciated. Regards, Hans| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |