frazap has asked for the wisdom of the Perl Monks concerning the following question:

I have names all composed of ascii characters, that I need to uniformize:

I came with the code below that seems to work (as far as I can test it).

My questions: how could I improved it ? (I think it will break if with unicode characters, what changes should I made to get it work with any character set ?)

Thanks

François

use strict; use warnings; while ( my $t = <DATA> ) { chomp $t; printf "orig: %-30s translated: %s\n", $t, translate($t); } sub translate { my $str = shift; $str =~ tr/-/ /; #replace - with a space $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word if ( $w =~ /\p{isLower}\p{isUpper}/ ) { my @all; while ( $w =~ m/\G(\p{isUpper}*\p{isLower}+)/g ) { push @all, $1; } $w = join( " ", @all ); } else { $w = ucfirst( lc($w) ); # we are using side effect of fore +ach loop } } return join( ' ', @words ); } __DATA__ Acierno James S., Jr. Acierno James, Jr. Ackermann-Hirschi L. Agatonovic-Jovini T. Alba-Castro Jose-Luis Alconada Verzini M. J. AlconadaVerzini M. J. Alvarez Fernandez A. Alvarez-Bolado Gonzalo Alvarez-Gonzalez B. AlvarezGonzalez B. AlvarezPiqueras D Amor Dos Santos S. P. Amor DosSantos S. P. AmorDosSantos S. P da Costa F. Barreiro Guimaraes Dano Hoffmann M. DanoHoffmann M. Dell' Acqua A. Dell' Asta L. Dell'Acqua A. Dell'Asta L. Dell'Omo Giacomo della Volp D. della Volpe D. Della Volpe D. DeRegie J. B. De Vivie Derendarz D. deRenstrom P. A. Bruckman Dupl'akova Nikoleta Duplakova Nikoleta Faucci Giannelli M. Fauccigiannelli M. FaucciGiannelli M. Yusuff I. Yusuff' I. Yao W-M Yao W-M. Yao W. -M Yao W. -M.

Replies are listed 'Best First'.
Re: regex: help for improvement
by hippo (Archbishop) on Dec 14, 2018 at 09:14 UTC
    how could I improved it ?

    The best thing you could do in general terms is to put the subroutine into a module and then write a test script (eg. using Test::More) which would compare your test set of input data with your required set of output data. This allows you to add extra functionality later while catching regressions.

    Here are a couple of specific suggestions, though. When I saw this:

    $str =~ tr/-/ /; #replace - with a space $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space

    I took some time to wonder why the first statement was there when the second statement seemed to render it obsolete. Why not remove the top one?

    Also, I think the if-block for breaking camelCase could be greatly simplified. eg:

    $w =~ s/(\p{isLower})(\p{isUpper})/$1 $2/g or $w = ucfirst( lc($w) );

    This shortens the code to:

    #!/usr/bin/perl use strict; use warnings; while ( my $t = <DATA> ) { chomp $t; printf "orig: %-30s translated: %s\n", $t, translate($t); } sub translate { my $str = shift; $str =~ tr/a-zA-Z/ /cs; #replace non letter with a space my @words = split( /\s+/, $str ); foreach my $w (@words) { #insert a space when a upper case is inside a word $w =~ s/(\p{isLower})(\p{isUpper})/$1 $2/g or $w = ucfirst( lc($w) ); # we are using side effect of fore +ach loop } return join( ' ', @words ); } __DATA__ Acierno James S., Jr. Acierno James, Jr. Ackermann-Hirschi L. Agatonovic-Jovini T. Alba-Castro Jose-Luis Alconada Verzini M. J. AlconadaVerzini M. J. Alvarez Fernandez A. Alvarez-Bolado Gonzalo Alvarez-Gonzalez B. AlvarezGonzalez B. AlvarezPiqueras D Amor Dos Santos S. P. Amor DosSantos S. P. AmorDosSantos S. P da Costa F. Barreiro Guimaraes Dano Hoffmann M. DanoHoffmann M. Dell' Acqua A. Dell' Asta L. Dell'Acqua A. Dell'Asta L. Dell'Omo Giacomo della Volp D. della Volpe D. Della Volpe D. DeRegie J. B. De Vivie Derendarz D. deRenstrom P. A. Bruckman Dupl'akova Nikoleta Duplakova Nikoleta Faucci Giannelli M. Fauccigiannelli M. FaucciGiannelli M. Yusuff I. Yusuff' I. Yao W-M Yao W-M. Yao W. -M Yao W. -M.

    HTH.

    (Edited to fix the Test::More link - thanks Laurent_R and kcott for pointing this out)

Re: regex: help for improvement
by choroba (Cardinal) on Dec 14, 2018 at 09:22 UTC
    Why is della Vople D. translated to Della Volpe D, but deRenstrom P. A. Bruckman becomes de Renstrom P A Bruckman? In other words, you don't capitalize the first word if it comes from a snakeCase word.

    I tried to solve this with only regexes:

    #!/usr/bin/perl use warnings; use strict; use Test::More; while ( my $t = <DATA> ) { chomp $t; is new_translate($t), translate($t); } done_testing(); sub translate { # the OP's code here } sub new_translate { my ($str) = @_; $str =~ tr/-/ /; $str =~ tr/a-zA-Z/ /cs; $str =~ s/(?<=\p{isLower})(?=\p{isUpper})/ /g; $str =~ s/(?:(?<=\s)|(?<=^))(\p{isLower})/\u$1/g; $str =~ s/\s+$//r }

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      I had not seen the de Renstroem ... So the function is even not working as I was expecting.

      To treat names with accented letter (say utf8) using your function, I tried this

      my ($str) = @_; $str =~ tr/-/ /; #$str =~ tr/a-zA-Z/ /cs; my $new; while ( $str =~ m/\G([\p{isUpper}|\p{isLower}|\s]+)/g ) { $new.=$1; } $str = $new; $str =~ s/(?<=\p{isLower})(?=\p{isUpper})/ /g; $str =~ s/(?:(?<=\s)|(?<=^))(\p{isLower})/\u$1/g; $str =~ s/\s+$//r

      but it's not working since any character that is not a letter or a space break the loop and the rest is lost. How can I adapt tr/a-zA-Z/ /cs for unicode character ?

      Thanks

      F.
        Do you want to replace any sequence of non-letters by a space?
        $str =~ s/\P{isLetter}+/ /g;

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        I had not seen the de Renstroem ... So the function is even not working as I was expecting.

        frazap:   WRT the use of Test::More for posing questions (to PerlMonks or to yourself during development!), see also neilwatson's article How to ask better questions using Test::More and sample data. Of course, the essential question you're asking yourself during development is "Does this code work the way I think it works?"


        Give a man a fish:  <%-{-{-{-<

        I had not seen the de Renstroem ... So the function is even not working as I was expecting.
        That's where using a testing module such as Test::More, as suggested earlier by hippo, is really useful and handy.
Re: regex: help for improvement
by Eily (Monsignor) on Dec 14, 2018 at 09:37 UTC

    All the spaces that you want to insert can be added in one go, with the logic:

    • replace all non letter chars by a space (I my version, I replace a group of non letter chars by a single space, that way "W-M" and "W. -M" both come "W M")
    • replace the position between a letter an an upper cas letter by a space (if you have two upper case in a row, you still want a space right? eg "DAcosta" => "D Acosta")
    while (my $name = <DATA>) { chomp($name); $name =~ s/ [^a-zA-Z]+ # Non letter chars | # (?<= [a-zA-Z] ) # Something that comes after a let +ter (?= [A-Z] ) # and comes before an uppercase le +tter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; say $name; }
    The (?<= ) (?= ) are Lookaround Assertions that do exactly what it says on the tin, and check around the current position, without including the checked value in the match (so the matched letters on both side aren't removed).

    Edit: you can chain s/// operations if you return the result with /r, but it's not very elegant:

    say s/[^a-zA-Z]+|(?<=[a-zA-Z])(?=[A-Z])/ /gr =~ s/(\w+)/\u\L$1/gr for +<DATA>;

      ... if you have two upper case in a row, you still want a space right?

      frazap:   Another question that finds its ideal home in a Test::More (or similar) test set.


      Give a man a fish:  <%-{-{-{-<

      Thanks a lot ...

      I tried to adapt your proposal for non ascii characters:
      sub translate_Eily { my $name = shift; $name =~ s/ \P{isLetter}+ # Non letter chars | # or (?<= \p{isLetter} ) # Something that comes after a + letter (?= \p{isUpper} ) # and comes before an upperc +ase letter / /xg; # \u is short for ucfirst and \L for lc $name =~ s/(\w+)/\u\L$1/g; return $name; }
      If I take care of utf8 in my script and in the I/O with
      use utf8; use open OUT => ':encoding(utf8)'; use open ':std';

      It works.

      François

        You probably don't need to use utf8; here. Please note that this pragma is used to enable/disable UTF-8 in source code and has no impact on the external data you're processing.