hdv.jadev has asked for the wisdom of the Perl Monks concerning the following question:
After a hiatus of some years (!) I've returned to my programming roots, i.e. perl. To help someone out with a large amount of texts containing non-ASCII characters I am writing a little script (I thought it would be easy...). However it seems I do not really understand perl's way of working with UTF-8 anymore. I've made a simplified example illustrating the problem:
When I run this, this is what comes out:#!/usr/bin/perl use strict; use warnings; use utf8; my %german_chars = ( 'Ä' => 'Ae', 'Ö' => 'Oe', 'Ü' => 'Ue', 'ä' => 'ae', 'ö' => 'oe', 'ü' => 'ue', 'ß' => 'ss' ); #For each argument apply the conversions selected for my $old_name (@ARGV) { my $new_name = do_german($old_name); print "Old name: '$old_name'\nNew name: '$new_name'\n"; } #Replace German characters with their ASCII equivalent sub do_german { my $string = shift; #Replace all German characters in the given string my @string_array = split //, $string; foreach (@string_array) { print "###Before replacement### $_\n"; $_ = $german_chars{$_} if $german_chars{$_}; print "###After replacement ### $_\n"; } $string = join '',@string_array; return $string; }
./test.pl für_elise ###Before replacement### f ###After replacement ### f ###Before replacement### � ###After replacement ### � ###Before replacement### � ###After replacement ### � ###Before replacement### r ###After replacement ### r ###Before replacement### _ ###After replacement ### _ ###Before replacement### e ###After replacement ### e ###Before replacement### l ###After replacement ### l ###Before replacement### i ###After replacement ### i ###Before replacement### s ###After replacement ### s ###Before replacement### e ###After replacement ### e Old name: 'für_elise' New name: 'für_elise'
Clearly what I wanted was "fuer_elise".
I know about split and character length, so I do understand why split turns the u umlaut into 2 separate characters. I've been playing around with "use Encoding", "use open" and "binmode", in an attempt to solve this, but I still do not really grasp what I am doing wrong. Especially because outside of the subroutine the "special" character is interpreted correctly.
I know I could use some module, but that wouldn't help me fill in the gap in my knowledge. Could one of you please enlighten me?
Your help is appreciated. Regards, Hans
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Trying to understand behavior of split and perl in general with UTF-8
by Yary (Pilgrim) on Jun 17, 2010 at 18:29 UTC | |
by hdv.jadev (Novice) on Jun 17, 2010 at 20:20 UTC | |
by almut (Canon) on Jun 17, 2010 at 20:42 UTC | |
by hdv.jadev (Novice) on Jun 18, 2010 at 11:02 UTC | |
by hdv.jadev (Novice) on Jul 22, 2010 at 12:16 UTC | |
|
Re: Trying to understand behavior of split and perl in general with UTF-8
by almut (Canon) on Jun 17, 2010 at 18:49 UTC |