jkeenan1 has asked for the wisdom of the Perl Monks concerning the following question:

I receive files with human names in ALL CAPS, including accented upper case characters. The files are in iso-8859-1.

CLÉ USB CLÉMMY USB

I need to "proper-case" these words, i.e., Initial Cap All Other Characters Lower Case.

When I open these files in vi, I don't see the upper-case accented E; I see a question-mark.

CL? USB CL?MMY USB
But when I run these lines through Encode::from_to, I do get the upper case accented E. But no matter what I do, I can't seem to lower-case that upper-case accented E. In fact, the succeeding character remains upper-case as well.
use strict; use warnings; use feature qw( :5.10 ); use Data::Dumper;$Data::Dumper::Indent=1; use Carp; use Encode qw( from_to ); use POSIX qw( setlocale LC_CTYPE ); setlocale(LC_CTYPE, "fr_CA.ISO8859-1"); my $file = q{./yard}; open my $IN, '<', $file or croak; while (my $l = <$IN>) { chomp $l; say "1: $l"; my $m = $l; from_to($m, "iso-8859-1", "utf8"); say "2: $m"; say "3: ", xlc($m); } close $IN or croak; sub xlc { my $str = shift; return join q{} => ( map { ucfirst(lc($_)) } ( $str =~ m/(\W+|\w+) +/g ) ); };
Results:
1: CL? USB 2: CLÉ USB 3: ClÉ Usb 1: CL?MMY USB 2: CLÉMMY USB 3: ClÉMmy Usb
What am I doing wrong?

Thank you very much.

Jim Keenan

Replies are listed 'Best First'.
Re: Unable to lc upper case accented characters
by Eliya (Vicar) on Feb 19, 2011 at 02:40 UTC

    Get rid of the locale stuff, and tell Perl what your input and (desired) output encodings are:

    #!/usr/bin/perl -w use strict; use feature qw( :5.10 ); use Carp; my $file = q{./yard}; open my $IN, '<:encoding(ISO8859-1)', $file or croak; binmode STDOUT, ":utf8"; # assuming your terminal is UTF-8 while (my $l = <$IN>) { # $l is now a Unicode/text string chomp $l; say "1: $l"; say "2: ", xlc($l); } ... sub xlc as you have it

    Output:

    1: CLÉ USB 2: Clé Usb 1: CLÉMMY USB 2: Clémmy Usb

    Using the PerlIO layer ":encoding(ISO8859-1)" with open tells Perl that your input is in ISO8859-1, and has the effect of decoding the data into Perl's internal Unicode format, so ucfirst etc. will work correctly.

    The general idea is to decode on input, and encode on output. Perl will then (ideally) do the rest.

      Thanks, Eliya. That DWIMs.
      Jim Keenan
Re: Unable to lc upper case accented characters
by wind (Priest) on Feb 19, 2011 at 02:53 UTC

    It's not going to help you with your special character issue, but you should take a look at the cpan module: Lingua::EN::NameCase.

    It will help you fix the special names like MacDonald etc.

    - Miller

      wind++. Excellent recommendation!

      D:\>chcp Active code page: 1252 D:\>type 889023.pl #!perl use strict; use warnings; use Lingua::EN::NameCase; binmode DATA, ':encoding(ISO-8859-1)'; binmode STDOUT, ':encoding(Windows-1252)'; while (my $original_name = <DATA>) { chomp $original_name; my $normalized_name = nc($original_name); printf "%30s %s\n", $original_name, $normalized_name; } __DATA__ MARILYN MCCORD ADAMS D'ALEMBERT, JEAN ÉTIENNE DE LA BOÉTIE ÉMILIE DU CHÂTELET HÉLÈNE CIXOUS DESCARTES, RENÉ durkheim, émile FREUD, SIGMUND GÖDEL, KURT þorsteinn gylfason OLIVER WENDELL HOLMES, JR. JUNG, CARL KANT, IMMANUEL MACHIAVELLI, NICCOLÒ MARX, KARL NIETZSCHE, FRIEDRICH ROUSSEAU, JEAN-JACQUES SARTRE, JEAN-PAUL SCHOPENHAUER, ARTHUR ANNE LOUISE GERMAINE DE STAËL D:\>perl 889023.pl MARILYN MCCORD ADAMS Marilyn McCord Adams D'ALEMBERT, JEAN D'Alembert, Jean ÉTIENNE DE LA BOÉTIE Étienne de la Boétie ÉMILIE DU CHÂTELET Émilie du Châtelet HÉLÈNE CIXOUS Hélène Cixous DESCARTES, RENÉ Descartes, René durkheim, émile Durkheim, Émile FREUD, SIGMUND Freud, Sigmund GÖDEL, KURT Gödel, Kurt þorsteinn gylfason Þorsteinn Gylfason OLIVER WENDELL HOLMES, JR. Oliver Wendell Holmes, Jr. JUNG, CARL Jung, Carl KANT, IMMANUEL Kant, Immanuel MACHIAVELLI, NICCOLÒ Machiavelli, Niccolò MARX, KARL Marx, Karl NIETZSCHE, FRIEDRICH Nietzsche, Friedrich ROUSSEAU, JEAN-JACQUES Rousseau, Jean-Jacques SARTRE, JEAN-PAUL Sartre, Jean-Paul SCHOPENHAUER, ARTHUR Schopenhauer, Arthur ANNE LOUISE GERMAINE DE STAËL Anne Louise Germaine de Staël D:\>

      When I remove the two calls to binmode, the script produces the same output. This is due to the fact that Lingua::EN::NameCase calls use locale. So whereas wind wrote, "It's not going to help you with your special character issue," the truth is, at least on a Microsoft Windows computer with the right code page and regional (i.e., locale) settings, the module does take care of the character encoding for you. Obviously, it's better and safer to be explicit about the character encodings in your Perl script.

      The module converts MCCORD to McCord, but it cleverly does not convert MACHIAVELLI to MacHiavelli. Perché no? Because Machiavelli ends with an i, so it rightly surmises it's an Italian name. Nice.

      My favorite name in the list is Þorsteinn Gylfason, converted from all lowercase letters, þorsteinn gylfason. (See þorn.info.)

Re: Unable to lc upper case accented characters
by Jim (Curate) on Feb 19, 2011 at 03:40 UTC

    This script…

    #!perl use strict; use warnings; binmode DATA, ':encoding(ISO-8859-1)'; binmode STDOUT, ':encoding(Windows-1252)'; while (my $original_names = <DATA>) { chomp $original_names; my $normalized_names = normalize_names($original_names); print "$original_names => $normalized_names\n"; } exit 0; sub normalize_names { return join '', map { ucfirst lc $_ } $_[0] =~ m/(\w+|\W+)/g; }; __DATA__ CLÉ USB CLÉMMY USB

    …produces this output…

    CLÉ USB => Clé Usb CLÉMMY USB => Clémmy Usb

    …at the Microsoft Windows command prompt (chcp 1252).