regex: help for improvement

frazap has asked for the wisdom of the Perl Monks concerning the following question:

I have names all composed of ascii characters, that I need to uniformize:

remove non letter (',.)
replace "-" by a space
start each word with an upper case
break camel case word: dosSantos -> Dos Santos

I came with the code below that seems to work (as far as I can test it).

My questions: how could I improved it ? (I think it will break if with unicode characters, what changes should I made to get it work with any character set ?)

Thanks

François

use strict;
use warnings;

while ( my $t = <DATA> ) {
    chomp $t;
    printf "orig: %-30s translated: %s\n", $t, translate($t);

}

sub translate {
    my $str = shift;
    $str =~ tr/-/ /;           #replace - with a space
    $str =~ tr/a-zA-Z/ /cs;    #replace non letter with a space
    my @words = split( /\s+/, $str );
    foreach my $w (@words) {

        #insert a space when a upper case is inside a word
        if ( $w =~ /\p{isLower}\p{isUpper}/ ) {
            my @all;
            while ( $w =~ m/\G(\p{isUpper}*\p{isLower}+)/g ) {
                push @all, $1;
            }
            $w = join( " ", @all );
        }
        else {

            $w = ucfirst( lc($w) ); # we are using side effect of fore
+ach loop
        }
    }
    return join( ' ', @words );
}
__DATA__
Acierno James S., Jr.
Acierno James, Jr.
Ackermann-Hirschi L.
Agatonovic-Jovini T.
Alba-Castro Jose-Luis
Alconada Verzini M. J.
AlconadaVerzini M. J.
Alvarez Fernandez A.
Alvarez-Bolado Gonzalo
Alvarez-Gonzalez B.
AlvarezGonzalez B.
AlvarezPiqueras D
Amor Dos Santos S. P.
Amor DosSantos S. P.
AmorDosSantos S. P
da Costa F. Barreiro Guimaraes
Dano Hoffmann M.
DanoHoffmann M.
Dell' Acqua A.
Dell' Asta L.
Dell'Acqua A.
Dell'Asta L.
Dell'Omo Giacomo
della Volp D.
della Volpe D.
Della Volpe D.
DeRegie J. B. De Vivie
Derendarz D.
deRenstrom P. A. Bruckman
Dupl'akova Nikoleta
Duplakova Nikoleta
Faucci Giannelli M.
Fauccigiannelli M.
FaucciGiannelli M.
Yusuff I.
Yusuff' I.
Yao W-M
Yao W-M.
Yao W. -M
Yao W. -M.
[download]

Comment on regex: help for improvement Download Code

Replies are listed 'Best First'.
Re: regex: help for improvement by hippo (Archbishop) on Dec 14, 2018 at 09:14 UTC
how could I improved it ? The best thing you could do in general terms is to put the subroutine into a module and then write a test script (eg. using Test::More) which would compare your test set of input data with your required set of output data. This allows you to add extra functionality later while catching regressions. Here are a couple of specific suggestions, though. When I saw this: `$str =~ tr/-/ /; #replace - with a space $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space` [download] I took some time to wonder why the first statement was there when the second statement seemed to render it obsolete. Why not remove the top one? Also, I think the if-block for breaking camelCase could be greatly simplified. eg: `$w =~ s/(\p{isLower})(\p{isUpper})/$1 $2/g or $w = ucfirst( lc($w) );` [download] This shortens the code to: #!/usr/bin/perl use strict; use warnings; while ( my $t = <DATA> ) { chomp $t; printf "orig: %-30s translated: %s\n", $t, translate($t); } sub translate { my $str = shift; $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word $w =~ s/(\p{isLower})(\p{isUpper})/$1 $2/g or $w = ucfirst( lc($w) ); # we are using side effect of fore +ach loop } return join( ' ', @words ); } __DATA__ Acierno James S., Jr. Acierno James, Jr. Ackermann-Hirschi L. Agatonovic-Jovini T. Alba-Castro Jose-Luis Alconada Verzini M. J. AlconadaVerzini M. J. Alvarez Fernandez A. Alvarez-Bolado Gonzalo Alvarez-Gonzalez B. AlvarezGonzalez B. AlvarezPiqueras D Amor Dos Santos S. P. Amor DosSantos S. P. AmorDosSantos S. P da Costa F. Barreiro Guimaraes Dano Hoffmann M. DanoHoffmann M. Dell' Acqua A. Dell' Asta L. Dell'Acqua A. Dell'Asta L. Dell'Omo Giacomo della Volp D. della Volpe D. Della Volpe D. DeRegie J. B. De Vivie Derendarz D. deRenstrom P. A. Bruckman Dupl'akova Nikoleta Duplakova Nikoleta Faucci Giannelli M. Fauccigiannelli M. FaucciGiannelli M. Yusuff I. Yusuff' I. Yao W-M Yao W-M. Yao W. -M Yao W. -M. [download] HTH. (Edited to fix the Test::More link - thanks Laurent_R and kcott for pointing this out)	[reply] [d/l] [select]
Re: regex: help for improvement by choroba (Cardinal) on Dec 14, 2018 at 09:22 UTC
Why is `della Vople D.` translated to `Della Volpe D`, but `deRenstrom P. A. Bruckman` becomes `de Renstrom P A Bruckman`? In other words, you don't capitalize the first word if it comes from a snakeCase word. I tried to solve this with only regexes: `#!/usr/bin/perl use warnings; use strict; use Test::More; while ( my $t = <DATA> ) { chomp $t; is new_translate($t), translate($t); } done_testing(); sub translate { # the OP's code here } sub new_translate { my ($str) = @_; $str =~ tr/-/ /; $str =~ tr/a-zA-Z/ /cs; $str =~ s/(?<=\p{isLower})(?=\p{isUpper})/ /g; $str =~ s/(?:(?<=\s)\|(?<=^))(\p{isLower})/\u$1/g; $str =~ s/\s+$//r }` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^2: regex: help for improvement by frazap (Monk) on Dec 14, 2018 at 14:11 UTC
I had not seen the de Renstroem ... So the function is even not working as I was expecting. To treat names with accented letter (say utf8) using your function, I tried this `my ($str) = @_; $str =~ tr/-/ /; #$str =~ tr/a-zA-Z/ /cs; my $new; while ( $str =~ m/\G([\p{isUpper}\|\p{isLower}\|\s]+)/g ) { $new.=$1; } $str = $new; $str =~ s/(?<=\p{isLower})(?=\p{isUpper})/ /g; $str =~ s/(?:(?<=\s)\|(?<=^))(\p{isLower})/\u$1/g; $str =~ s/\s+$//r` [download] but it's not working since any character that is not a letter or a space break the loop and the rest is lost. How can I adapt tr/a-zA-Z/ /cs for unicode character ? Thanks F.	[reply] [d/l]
Re^3: regex: help for improvement by choroba (Cardinal) on Dec 14, 2018 at 14:35 UTC
Do you want to replace any sequence of non-letters by a space? `$str =~ s/\P{isLetter}+/ /g;` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^4: regex: help for improvement by frazap (Monk) on Dec 14, 2018 at 15:25 UTC
Re^3: regex: help for improvement by AnomalousMonk (Archbishop) on Dec 14, 2018 at 19:21 UTC
I had not seen the de Renstroem ... So the function is even not working as I was expecting. frazap: WRT the use of Test::More for posing questions (to PerlMonks or to yourself during development!), see also neilwatson's article How to ask better questions using Test::More and sample data. Of course, the essential question you're asking yourself during development is "Does this code work the way I think it works?" Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^3: regex: help for improvement by Laurent_R (Canon) on Dec 14, 2018 at 18:16 UTC
I had not seen the de Renstroem ... So the function is even not working as I was expecting. That's where using a testing module such as Test::More, as suggested earlier by hippo, is really useful and handy.	[reply]
Re: regex: help for improvement by Eily (Monsignor) on Dec 14, 2018 at 09:37 UTC
All the spaces that you want to insert can be added in one go, with the logic: replace all non letter chars by a space (I my version, I replace a group of non letter chars by a single space, that way "W-M" and "W. -M" both come "W M") replace the position between a letter an an upper cas letter by a space (if you have two upper case in a row, you still want a space right? eg "DAcosta" => "D Acosta") `while (my $name = <DATA>) { chomp($name); $name =~ s/ [^a-zA-Z]+ # Non letter chars \| # (?<= [a-zA-Z] ) # Something that comes after a let +ter (?= [A-Z] ) # and comes before an uppercase le +tter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; say $name; }` [download] The `(?<= )` `(?= )` are Lookaround Assertions that do exactly what it says on the tin, and check around the current position, without including the checked value in the match (so the matched letters on both side aren't removed). Edit: you can chain s/// operations if you return the result with /r, but it's not very elegant: `say s/[^a-zA-Z]+\|(?<=[a-zA-Z])(?=[A-Z])/ /gr =~ s/(\w+)/\u\L$1/gr for +<DATA>;` [download]	[reply] [d/l] [select]
Re^2: regex: help for improvement by AnomalousMonk (Archbishop) on Dec 14, 2018 at 19:40 UTC
... if you have two upper case in a row, you still want a space right? frazap: Another question that finds its ideal home in a Test::More (or similar) test set. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^2: regex: help for improvement by frazap (Monk) on Dec 14, 2018 at 14:33 UTC
Thanks a lot ... I tried to adapt your proposal for non ascii characters: `sub translate_Eily { my $name = shift; $name =~ s/ \P{isLetter}+ # Non letter chars \| # or (?<= \p{isLetter} ) # Something that comes after a + letter (?= \p{isUpper} ) # and comes before an upperc +ase letter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; return $name; }` [download] If I take care of utf8 in my script and in the I/O with `use utf8; use open OUT => ':encoding(utf8)'; use open ':std';` [download] It works. François	[reply] [d/l] [select]
Re^3: regex: help for improvement by Laurent_R (Canon) on Dec 14, 2018 at 18:23 UTC
You probably don't need to `use utf8;` here. Please note that this pragma is used to enable/disable UTF-8 in source code and has no impact on the external data you're processing.	[reply] [d/l]
Re^4: regex: help for improvement by frazap (Monk) on Dec 20, 2018 at 14:05 UTC
Re^5: regex: help for improvement by 1nickt (Canon) on Dec 20, 2018 at 14:31 UTC