in reply to Re: Memory Leak with XS but not pure C
in thread Memory Leak with XS but not pure C

Thanks Nerdvana for your fast and very helpful reply!

Using newSVpvn in place of newSVpv does indeed solve the complaint from Valgrind. Also you were of course correct about the extraneous u8_strlen

I'm the epitomy of confusedness regarding Perl and unicode. Your tip to add "use utf8" was very useful as I'm passing the literal strings, and I added a check to the XS code prior to getting the string from the SV:

if(!SvUTF8(sv)) { sv = sv_mortalcopy(sv); sv_utf8_upgrade(sv); } s = SvPVutf8(sv, len)

As Marshall says below, the esszett is a strange character (well, it is German) as it uppercases to 'SS'. This happens with many characters of other languages too. The standard Perl uc just leaves it there when uppercasing. The libunisting library (not mine!) does it correctly.

Thanks again for the help!

Replies are listed 'Best First'.
Re^3: Memory Leak with XS but not pure C
by hippo (Archbishop) on Mar 29, 2025 at 14:25 UTC
    The standard Perl uc just leaves it there when uppercasing.

    Any currently supported perl should uppercase it correctly:

    $ cat uct.pl 
    #!/usr/bin/env perl
    use strict;
    use warnings;
    use utf8;
    
    my $string = 'straße';
    my $ucstring = uc $string;
    print "Uppercase $string is $ucstring\n";
    $ perl uct.pl 
    Uppercase straße is STRASSE
    $
    

    Are you running an old version?


    🦛

      uc suffers from The Unicode Bug when the unicode_strings feature isn't in enabled.

      It works correctly (giving SS for ß) when the string the unicode_strings feature is enabled.

      It works correctly (giving SS for ß) when the string is stored in the UTF8=1 format.

      It works incorrectly (ß unchanged) otherwise.

      use open ":std", ":locale"; use feature qw( say ); my $ss = "\xDF"; utf8::upgrade( my $ss_u = $ss ); utf8::downgrade( my $ss_d = $ss ); { no feature qw( unicode_strings ); say uc( $ss_d ); # ß say uc( $ss_u ); # SS } { use feature qw( unicode_strings ); say uc( $ss_d ); # SS say uc( $ss_u ); # SS }

        Thanks very much Ikegami. I hadn't heard of that feature

        How is it possible to get the same behaviour within an XS function

Re^3: Memory Leak with XS but not pure C
by jo37 (Curate) on Mar 29, 2025 at 14:42 UTC
    the esszett is a strange character (well, it is German) as it uppercases to 'SS'. This happens with many characters of other languages too. The standard Perl uc just leaves it there when uppercasing.

    I suspect you use uppercase for case-insensitive comparison - which is not correct. foldcase would be the way to go as fc "ß" is indeed "ss".

    Otherwise, maybe uc fc produces your desired result?

    Greetings,
    🐻

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$
Re^3: Memory Leak with XS but not pure C
by NERDVANA (Priest) on Mar 29, 2025 at 17:55 UTC

    If you were feeding the 'uc' operator a string of utf8 bytes from your editor which perl had not been informed was intended as unicode, then perl would apply ascii uppercasing rules to that string of bytes. Now that you have the "use utf8" in your file, I think you'll find that 'uc' works properly on that string. But, you'll also find that perl warns you if you try to print that string, because in the default configuration the output streams expect bytes as input. You can either use binmode(STDOUT, 'encoding(UTF-8)') to declare that you intend to always write unicode to the file handle, or remember to encode the string before printing.

    Full unicode support exists in perl, but yeah it's kind of a learning curve to find it :-(   But that's the price we pay for full multi-decade back-compat.

      yeah your are right

      . This will be part of a bigger XS thing. Is there a macro I can use for uppercasing?

        Actually I ran into this problem with my Tree::RB::XS module when I wanted to case-fold the keys. The 'uc' operator doesn't have a clean alternative C API available. There are API calls for single characters like 'toUPPER_utf8' but I didn't dig enough to find out if there's a robust way to call this in a loop for all the different versions of perl. The implementation of the uc operator (grep for "pp_uc" in pp.c) has a bunch of ifdef conditionals which have probably changed a lot over the years.

        Since I want to support back to 5.8, I decided to just call out to the perl function with call_pv("CORE::fc", G_SCALAR);. But, as the nearby comments mention, before perl 5.16 that wasn't a function so I needed to wrap the op with a function as sub _fc_impl { lc shift } and then call that.

        Since calling perl functions is a decent bit of overhead, if you need this to run in a hot code path you might still be better off with your external unicode library. Or, if you want to avoid that dependency and stick to recent versions of perl you could just copy/paste most of the pp_uc implementation into your own function and call that (but careful with copyrights there).

        And... um... if you get a reasonably robust version made with the perl API, I'd love to improve the performance of Tree::RB::XS :-)