comment on

Hi,

After a hiatus of some years (!) I've returned to my programming roots, i.e. perl. To help someone out with a large amount of texts containing non-ASCII characters I am writing a little script (I thought it would be easy...). However it seems I do not really understand perl's way of working with UTF-8 anymore. I've made a simplified example illustrating the problem:

#!/usr/bin/perl

use strict;
use warnings;
use utf8;

my %german_chars = (
  'Ä' => 'Ae', 'Ö' => 'Oe', 'Ü' => 'Ue', 'ä' => 'ae',
  'ö' => 'oe', 'ü' => 'ue', 'ß' => 'ss'
);

#For each argument apply the conversions selected
for my $old_name (@ARGV) {
  my $new_name = do_german($old_name);
  print "Old name: '$old_name'\nNew name: '$new_name'\n";
}

#Replace German characters with their ASCII equivalent
sub do_german {
  my $string = shift;  
  #Replace all German characters in the given string
  my @string_array = split //, $string;
  foreach (@string_array) {
    print "###Before replacement### $_\n";
    $_ = $german_chars{$_} if $german_chars{$_};
    print "###After replacement ### $_\n";
  }
  $string = join '',@string_array;
  return $string;
}
[download]

When I run this, this is what comes out:

./test.pl für_elise
###Before replacement### f
###After replacement ### f
###Before replacement### &#65533;
###After replacement ### &#65533;
###Before replacement### &#65533;
###After replacement ### &#65533;
###Before replacement### r
###After replacement ### r
###Before replacement### _
###After replacement ### _
###Before replacement### e
###After replacement ### e
###Before replacement### l
###After replacement ### l
###Before replacement### i
###After replacement ### i
###Before replacement### s
###After replacement ### s
###Before replacement### e
###After replacement ### e
Old name: 'für_elise'
New name: 'für_elise'
[download]

Clearly what I wanted was "fuer_elise".

I know about split and character length, so I do understand why split turns the u umlaut into 2 separate characters. I've been playing around with "use Encoding", "use open" and "binmode", in an attempt to solve this, but I still do not really grasp what I am doing wrong. Especially because outside of the subroutine the "special" character is interpreted correctly.

I know I could use some module, but that wouldn't help me fill in the gap in my knowledge. Could one of you please enlighten me?

Your help is appreciated. Regards, Hans

In reply to Trying to understand behavior of split and perl in general with UTF-8 by hdv.jadev

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.