in reply to regex: help for improvement

All the spaces that you want to insert can be added in one go, with the logic:

while (my $name = <DATA>) { chomp($name); $name =~ s/ [^a-zA-Z]+ # Non letter chars | # (?<= [a-zA-Z] ) # Something that comes after a let +ter (?= [A-Z] ) # and comes before an uppercase le +tter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; say $name; }
The (?<= ) (?= ) are Lookaround Assertions that do exactly what it says on the tin, and check around the current position, without including the checked value in the match (so the matched letters on both side aren't removed).

Edit: you can chain s/// operations if you return the result with /r, but it's not very elegant:

say s/[^a-zA-Z]+|(?<=[a-zA-Z])(?=[A-Z])/ /gr =~ s/(\w+)/\u\L$1/gr for +<DATA>;

Replies are listed 'Best First'.
Re^2: regex: help for improvement
by AnomalousMonk (Archbishop) on Dec 14, 2018 at 19:40 UTC
    ... if you have two upper case in a row, you still want a space right?

    frazap:   Another question that finds its ideal home in a Test::More (or similar) test set.


    Give a man a fish:  <%-{-{-{-<

Re^2: regex: help for improvement
by frazap (Monk) on Dec 14, 2018 at 14:33 UTC

    Thanks a lot ...

    I tried to adapt your proposal for non ascii characters:
    sub translate_Eily { my $name = shift; $name =~ s/ \P{isLetter}+ # Non letter chars | # or (?<= \p{isLetter} ) # Something that comes after a + letter (?= \p{isUpper} ) # and comes before an upperc +ase letter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; return $name; }
    If I take care of utf8 in my script and in the I/O with
    use utf8; use open OUT => ':encoding(utf8)'; use open ':std';

    It works.

    François

      You probably don't need to use utf8; here. Please note that this pragma is used to enable/disable UTF-8 in source code and has no impact on the external data you're processing.

        For completeness, here is the answers above with the use of Test::More

        use utf8 is needed because of the accented characters in the text in the bottom part

        use strict; use warnings; use utf8; use Test::More; use open OUT => ':encoding(utf8)'; use open ':std'; my $pos=1; while ( my $t = <DATA> ) { chomp $t; #printf "orig: %-30s translated: %s\n", $t, translate_yeli2($t); my @data = split/\s\*\*\s/, $t; my $res = translate_eily2($data[0]); # die $data[1]; ok ($res eq $data[1], "*" . $data[0]. "* -> *" . $res . "* / *". $d +ata[1]. "* " . $pos++); } done_testing; sub translate { my $str = shift; $str =~ tr/-/ /; #replace - with a space $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word if ( $w =~ /\p{isLower}\p{isUpper}/ ) { my @all; while ( $w =~ m/\G(\p{isUpper}*\p{isLower}+)/g ) { push @all, $1; } $w = join( " ", @all ); } else { $w = ucfirst( lc($w) ); # we are using side effect of fore +ach loop } } return join( ' ', @words ); } sub translate_eily { my $name = shift; $name =~ s/ [^a-zA-Z]+ # Non letter chars | # (?<= [a-zA-Z] ) # Something that comes after a let +ter (?= [A-Z] ) # and comes before an uppercase le +tter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; $name =~s/\s+$//g; return $name; } sub translate_hippo { my $str = shift; $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word $w =~ s/(\p{isLower})(\p{isUpper})/$1 $2/g or $w = ucfirst( lc($w) ); } return join( ' ', @words ); } sub translate_choroba { my ($str) = @_; $str =~ tr/-/ /; #$str =~ tr/a-zA-Z/ /cs; $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space $str =~ s/(?<=\p{isLower})(?=\p{isUpper})/ /g; $str =~ s/(?:(?<=\s)|(?<=^))(\p{isLower})/\u$1/g; $str =~ s/\s+$//r; } sub translate_eily2 { my $name = shift; $name =~ s/ \P{isLetter}+ # Non letter chars | # or (?<= \p{isLetter} ) # Something that comes after a + letter (?= \p{isUpper} ) # and comes before an upperc +ase letter / /xg; # print "after s: *$name*\n"; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; $name =~s/\s+$//g; return $name; } __DATA__ Acierno James S., Jr. ** Acierno James S Jr Ackermann-Hirschi L. ** Ackermann Hirschi L Alba-Castro Jose-Luis ** Alba Castro Jose Luis Boulangère Françoise ** Boulangère Françoise AlconadaVerzini M. J. ** Alconada Verzini M J AmorDosSantos S. P ** Amor Dos Santos S P da Costa F. Barreiro Guimaraes ** Da Costa F Barreiro Guimaraes deRenstrom P. A. Bruckman ** De Renstrom P A Bruckman Fauccigiannelli M. ** Fauccigiannelli M FaucciGiannelli M. ** Faucci Giannelli M Yao W-M ** Yao W M Yao W-M. ** Yao W M Yao W. -M ** Yao W M Yao W. -M. ** Yao W M