Re: regex: help for improvement

All the spaces that you want to insert can be added in one go, with the logic:

replace all non letter chars by a space (I my version, I replace a group of non letter chars by a single space, that way "W-M" and "W. -M" both come "W M")
replace the position between a letter an an upper cas letter by a space (if you have two upper case in a row, you still want a space right? eg "DAcosta" => "D Acosta")

while (my $name = <DATA>)
{
  chomp($name);
  $name =~ s/
                [^a-zA-Z]+          # Non letter chars
              |                     #
                (?<= [a-zA-Z] )     # Something that comes after a let
+ter
                (?=  [A-Z] )        # and comes before an uppercase le
+tter
            / /xg;

  # \u is short for ucfirst and \L for lc
  $name =~ s/(\w+)/\u\L$1/g; 

  say $name;
}
[download]

The (?<= ) (?= ) are Lookaround Assertions that do exactly what it says on the tin, and check around the current position, without including the checked value in the match (so the matched letters on both side aren't removed).

Edit: you can chain s/// operations if you return the result with /r, but it's not very elegant:

say s/[^a-zA-Z]+|(?<=[a-zA-Z])(?=[A-Z])/ /gr =~ s/(\w+)/\u\L$1/gr for 
+<DATA>;
[download]

Comment on Re: regex: help for improvement Select or Download Code

Replies are listed 'Best First'.
Re^2: regex: help for improvement by AnomalousMonk (Archbishop) on Dec 14, 2018 at 19:40 UTC
... if you have two upper case in a row, you still want a space right? frazap: Another question that finds its ideal home in a Test::More (or similar) test set. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^2: regex: help for improvement by frazap (Monk) on Dec 14, 2018 at 14:33 UTC
Thanks a lot ... I tried to adapt your proposal for non ascii characters: `sub translate_Eily { my $name = shift; $name =~ s/ \P{isLetter}+ # Non letter chars \| # or (?<= \p{isLetter} ) # Something that comes after a + letter (?= \p{isUpper} ) # and comes before an upperc +ase letter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; return $name; }` [download] If I take care of utf8 in my script and in the I/O with `use utf8; use open OUT => ':encoding(utf8)'; use open ':std';` [download] It works. François	[reply] [d/l] [select]
Re^3: regex: help for improvement by Laurent_R (Canon) on Dec 14, 2018 at 18:23 UTC
You probably don't need to `use utf8;` here. Please note that this pragma is used to enable/disable UTF-8 in source code and has no impact on the external data you're processing.	[reply] [d/l]
Re^4: regex: help for improvement by frazap (Monk) on Dec 20, 2018 at 14:05 UTC
For completeness, here is the answers above with the use of Test::More `use utf8` is needed because of the accented characters in the text in the bottom part use strict; use warnings; use utf8; use Test::More; use open OUT => ':encoding(utf8)'; use open ':std'; my $pos=1; while ( my $t = <DATA> ) { chomp $t; #printf "orig: %-30s translated: %s\n", $t, translate_yeli2($t); my @data = split/\s\\\s/, $t; my $res = translate_eily2($data[0]); # die $data[1]; ok ($res eq $data[1], "" . $data[0]. " -> " . $res . " / ". $d +ata[1]. " " . $pos++); } done_testing; sub translate { my $str = shift; $str =~ tr/-/ /; #replace - with a space $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word if ( $w =~ /\p{isLower}\p{isUpper}/ ) { my @all; while ( $w =~ m/\G(\p{isUpper}\p{isLower}+)/g ) { push @all, $1; } $w = join( " ", @all ); } else { $w = ucfirst( lc($w) ); # we are using side effect of fore +ach loop } } return join( ' ', @words ); } sub translate_eily { my $name = shift; $name =~ s/ [^a-zA-Z]+ # Non letter chars \| # (?<= [a-zA-Z] ) # Something that comes after a let +ter (?= [A-Z] ) # and comes before an uppercase le +tter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; $name =~s/\s+$//g; return $name; } sub translate_hippo { my $str = shift; $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word $w =~ s/(\p{isLower})(\p{isUpper})/$1 $2/g or $w = ucfirst( lc($w) ); } return join( ' ', @words ); } sub translate_choroba { my ($str) = @_; $str =~ tr/-/ /; #$str =~ tr/a-zA-Z/ /cs; $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space $str =~ s/(?<=\p{isLower})(?=\p{isUpper})/ /g; $str =~ s/(?:(?<=\s)\|(?<=^))(\p{isLower})/\u$1/g; $str =~ s/\s+$//r; } sub translate_eily2 { my $name = shift; $name =~ s/ \P{isLetter}+ # Non letter chars \| # or (?<= \p{isLetter} ) # Something that comes after a + letter (?= \p{isUpper} ) # and comes before an upperc +ase letter / /xg; # print "after s: $name\n"; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; $name =~s/\s+$//g; return $name; } __DATA__ Acierno James S., Jr. * Acierno James S Jr Ackermann-Hirschi L. Ackermann Hirschi L Alba-Castro Jose-Luis Alba Castro Jose Luis Boulangère Françoise Boulangère Françoise AlconadaVerzini M. J. Alconada Verzini M J AmorDosSantos S. P Amor Dos Santos S P da Costa F. Barreiro Guimaraes Da Costa F Barreiro Guimaraes deRenstrom P. A. Bruckman De Renstrom P A Bruckman Fauccigiannelli M. Fauccigiannelli M FaucciGiannelli M. Faucci Giannelli M Yao W-M Yao W M Yao W-M. Yao W M Yao W. -M Yao W M Yao W. -M. ** Yao W M [download]	[reply] [d/l] [select]
Re^5: regex: help for improvement by 1nickt (Canon) on Dec 20, 2018 at 14:31 UTC