timgreenwood has asked for the wisdom of the Perl Monks concerning the following question:

I want to use simple Unicode case mapping and not the default full case mapping. The entry in http://perldoc.perl.org/perlunicode.html#User-Defined-Case-Mappings looked to be the way to go.

However a small test program
#!/usr/bin/perl use strict; sub ToUpper { return<<END; 0061\t0063\t0044 END print "Here we are\n"; } my $tim = "abcdef"; my $t2 = uc($tim); print "$t2\n";

does not give the expected result, just the normal uc output. The print statement is not executed so ToUpper is not even called.

What am I doing wrong? If anyone already has simple case mapping already implemented then that would be nice also.

Replies are listed 'Best First'.
Re: User-Defined Case Mappings
by Tanktalus (Canon) on Feb 23, 2009 at 23:57 UTC

    I'm curious as to how this handles multiple sections of the UTF space simultaneously, but nevermind that ;-)

    First off, your print statement is after your return statement. It'll never execute under any circumstances.

    Second, your $tim string isn't UTF8, so it's moot. Try decoding it into utf8 using the Encode module.

    This is what I got to work:

    #!/usr/bin/perl use strict; use Encode; sub ToUpper { print "Here we are\n"; return<<END; 0061\t0063\t0041 END } #my $tim = "abcdef"; my $tim = Encode::decode('utf8',"abcdef"); my $t2 = uc($tim); print "[$t2]\n";
    Good luck,

      I'm curious as to how this handles multiple sections of the UTF space simultaneously, but nevermind that ;-)

      Well, it doesn't, of course. The working code that you posted essentially disables all lower-to-upper case conversions except for the first three ascii lower-case letters. Here's a version that handles a couple different ranges (warning to potential users: STDOUT includes utf8 wide characters):

      #!/usr/bin/perl use strict; use warnings; binmode STDOUT,":utf8"; my $tim = "abcdef \x{ff41}\x{ff42}\x{ff43}\x{ff44}\x{ff45}\x{ff46}"; print "main::uc( $tim ) => ", uc($tim), "\n"; sub ToUpper { return <<END; 0061\t0063\t0041 ff41\tff43\tff21 END }
      But the description of "user-defined case mappings" in the perlunicode man page seems to be lacking something, IMO -- to wit: why would anyone want this? It does not seem to provide the same sort of usefulness that you get with user-defined character classes (described in the previous section of the man page).

      I tried to see if I could make different packages with different case mappings, and it didn't work as hoped for -- in fact, it appears that the first package to define the "ToUpper" and other case-relation functions will set the case relations immutably for the rest of the script.

      Here's a test, which I tried two different ways, once calling the two package subs in the order shown, then in the other order. The second sub call always gives the same result as the first call (i.e. both calls always use the mapping created by the first call):

      I have to admit, I don't see the point of this feature, except to make up some really wicked obfu.

      (updated to add readmore tags)

      Thank you Tanktalus - my first visit to the monastery has been very successful. I should have realized the problem having used the Encode module before. I do still see one (easily avoidable) issue in that contrary to the description in http://perldoc.perl.org/Encode.html decode_utf8(string) is not working as a synonym (in this case) for decode('utf8',string). I am using perl, v5.8.5 built for x86_64-linux-thread-multi - could this be an implementation problem? This is shown in the snippet below.
      #!/usr/bin/perl use strict; use Encode; sub ToUpper { return<<END; 0061\t0063\t0041 END } # Below fails my $tim = decode_utf8("abcdef"); # But this one works #my $tim = decode('utf8',"abcdef"); print uc($tim),"\n";
      I will respond to the other questions separately.
Re: User-Defined Case Mappings
by graff (Chancellor) on Feb 24, 2009 at 01:38 UTC
    Now that Tanktalus has answered your questions, I'd be curious to understand what you mean when you say:

    I want to use simple Unicode case mapping and not the default full case mapping.

    When I use Unicode strings in perl, I consider "simple Unicode case mapping" to be the same thing as "default full case mapping". But you seem to mean something else by it, so I'm confused.

    I'm also curious whether you really intended to use "0044" as the third element being returned by your "ToUpper" sub. Did you really want uc("abcdef") to return "DEFdef"?

      I'm also curious whether you really intended to use "0044" as the third element being returned by your "ToUpper" sub. Did you really want uc("abcdef") to return "DEFdef"?

      My code snippet was just to understand how to make the user mappings work. The snippet itself is not of practical use. The choice of "044" as the third element was pretty arbitrary and just to make the difference from the default mapping stand out.

      When I use Unicode strings in perl, I consider "simple Unicode case mapping" to be the same thing as "default full case mapping". But you seem to mean something else by it, so I'm confused.

      Unicode case mapping is described in Unicode Standard Annex #21. Briefly - a full mapping may expand one character to multiple characters. For example, the German character U+00DF "ß" small letter sharp s expands when uppercased to the sequence of two characters "SS". Casing may also be context dependent. Simple case mapping is defined in the UnicodeData.txt file from the Unicode Charater Database. It gives a 1-1 only case mapping that is not context dependent.

Re: User-Defined Case Mappings
by kennethk (Abbot) on Feb 23, 2009 at 22:57 UTC

    The mistake you are making is never calling your subroutine. The sub ToUpper{} syntax just generates the code; to execute it you need to invoke it, with perhaps ToUpper();. A read through perlsub may be illuminating.

    A great deal of effort has been put into dealing with complex character sets over the years. There's a decent intro to what you need to do to keep from shooting yourself in the foot in perlunitut.

    Never mind: What Tanktalus said. OP's code is out of User Defined Case Mappings