comment on

I have names all composed of ascii characters, that I need to uniformize:

remove non letter (',.)
replace "-" by a space
start each word with an upper case
break camel case word: dosSantos -> Dos Santos

I came with the code below that seems to work (as far as I can test it).

My questions: how could I improved it ? (I think it will break if with unicode characters, what changes should I made to get it work with any character set ?)

Thanks

François

use strict;
use warnings;

while ( my $t = <DATA> ) {
    chomp $t;
    printf "orig: %-30s translated: %s\n", $t, translate($t);

}

sub translate {
    my $str = shift;
    $str =~ tr/-/ /;           #replace - with a space
    $str =~ tr/a-zA-Z/ /cs;    #replace non letter with a space
    my @words = split( /\s+/, $str );
    foreach my $w (@words) {

        #insert a space when a upper case is inside a word
        if ( $w =~ /\p{isLower}\p{isUpper}/ ) {
            my @all;
            while ( $w =~ m/\G(\p{isUpper}*\p{isLower}+)/g ) {
                push @all, $1;
            }
            $w = join( " ", @all );
        }
        else {

            $w = ucfirst( lc($w) ); # we are using side effect of fore
+ach loop
        }
    }
    return join( ' ', @words );
}
__DATA__
Acierno James S., Jr.
Acierno James, Jr.
Ackermann-Hirschi L.
Agatonovic-Jovini T.
Alba-Castro Jose-Luis
Alconada Verzini M. J.
AlconadaVerzini M. J.
Alvarez Fernandez A.
Alvarez-Bolado Gonzalo
Alvarez-Gonzalez B.
AlvarezGonzalez B.
AlvarezPiqueras D
Amor Dos Santos S. P.
Amor DosSantos S. P.
AmorDosSantos S. P
da Costa F. Barreiro Guimaraes
Dano Hoffmann M.
DanoHoffmann M.
Dell' Acqua A.
Dell' Asta L.
Dell'Acqua A.
Dell'Asta L.
Dell'Omo Giacomo
della Volp D.
della Volpe D.
Della Volpe D.
DeRegie J. B. De Vivie
Derendarz D.
deRenstrom P. A. Bruckman
Dupl'akova Nikoleta
Duplakova Nikoleta
Faucci Giannelli M.
Fauccigiannelli M.
FaucciGiannelli M.
Yusuff I.
Yusuff' I.
Yao W-M
Yao W-M.
Yao W. -M
Yao W. -M.
[download]

In reply to regex: help for improvement by frazap

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.