FrankFooty has asked for the wisdom of the Perl Monks concerning the following question:

Oh perl monks,

I'm using the 'libunistring' library in C to uppercase utf8-encoded characters.

In C it works fine with no memory leaks according to Valgrind. But, when I try to do it in XS it works, but valgrind complains loudly.

Here's the C:

#include <stdio.h> #include <unicase.h> #include <stdlib.h> #include <unistr.h> #include <uninorm.h> #include <string.h> #include <unistr.h> #include <unitypes.h> void main(){ char* instring = "Hauptstraße 3"; // Works with uint8_t too size_t length; uint8_t *result; size_t input_length; input_length=strlen(instring); //u8_strlen works too result = u8_toupper (instring, input_length, NULL, UNINORM_NFC, NULL, +&length);

The result is as expected: HAUPTSTRASSE 3

my XS function (same includes as above):

SV* uppercase_utf8_2(SV* sv) PREINIT: size_t len; char* s; char* upperstring; //uint8_t also SV* aresult; size_t upperlength; CODE: s = SvPVbyte(sv, len); //strings from my editor (locale?) are alr +eady utf8 encoded upperstring = u8_toupper (s, len, NULL, NULL,NULL, &upperlength); upperlength = u8_strlen(upperstring); aresult = newSVpv(upperstring,upperlength); free(upperstring); RETVAL = aresult; OUTPUT: RETVAL

All hunky-dory except for a tonne of errors from Valgirind, none of which mention the XS sub

e.g:

==614897== Invalid read of size 1 ==614897== at 0x484F234: strlen (in /usr/libexec/valgrind/vgpreload +_memcheck-amd64-linux.so) ==614897== by 0x485B74A: XS_Simple__Ngram_uppercase_utf8_2 (Ngram.x +s:79) ==614897== by 0x2441A9: ??? (in /usr/bin/perl) ==614897== by 0x2396DD: Perl_runops_standard (in /usr/bin/perl) ==614897== by 0x1781DA: perl_run (in /usr/bin/perl) ==614897== by 0x14D639: main (in /usr/bin/perl) ==614897== Address 0x5bb9282 is 0 bytes after a block of size 18 allo +c'd ==614897== at 0x484DB80: realloc (in /usr/libexec/valgrind/vgpreloa +d_memcheck-amd64-linux.so) ==614897== by 0x55AEF01: libunistring_u8_casemap (in /usr/lib/x86_6 +4-linux-gnu/libunistring.so.5.0.0) ==614897== by 0x55AF478: u8_toupper (in /usr/lib/x86_64-linux-gnu/l +ibunistring.so.5.0.0) ==614897== by 0x485B73F: XS_Simple__Ngram_uppercase_utf8_2 (Ngram.x +s:78) ==614897== by 0x2441A9: ??? (in /usr/bin/perl) ==614897== by 0x2396DD: Perl_runops_standard (in /usr/bin/perl) ==614897== by 0x1781DA: perl_run (in /usr/bin/perl) ==614897== by 0x14D639: main (in /usr/bin/perl) ==614897== ==614897== ==614897== HEAP SUMMARY: ==614897== in use at exit: 2,544,437 bytes in 9,481 blocks ==614897== total heap usage: 33,789 allocs, 24,308 frees, 6,454,177 +bytes allocated ==614897== ==614897== 2 bytes in 1 blocks are possibly lost in loss record 1 of 1 +,342 ==614897== at 0x4846828: malloc (in /usr/libexec/valgrind/vgpreload +_memcheck-amd64-linux.so) ==614897== by 0x24EADB: Perl_sv_magicext (in /usr/bin/perl) ==614897== by 0x24ED0A: Perl_sv_magic (in /usr/bin/perl) ==614897== by 0x183F8C: Perl_gv_fetchpvn_flags (in /usr/bin/perl) ==614897== by 0x174ED2: perl_parse (in /usr/bin/perl) ==614897== by 0x14D55B: main (in /usr/bin/perl)

Any ideas?

Replies are listed 'Best First'.
Re: Memory Leak with XS but not pure C
by NERDVANA (Priest) on Mar 28, 2025 at 19:05 UTC
    If newSVpv is given a length of zero, it calls strlen on that pointer. You probably want newSVpvn. If your u8 library was given an actual string which was length 0, then it may have called malloc with a length zero, and malloc is permitted to return a magic pointer value that indicates no allocation needs freed, which means you can't legally read the first byte of it during strlen, which seems like what valgrind is complaining about.

    Also, why does your code call u8_strlen when you already have the length?

    For the rest (33K mallocs, 24K frees), I'm guessing perl doesn't bother to deeply free every data structure as it exits, since that would just waste time when the OS will clean it up anyway.

    Update

    Also, I suspect you should be using SvPVutf8 instead of SvPVbyte. The comment that "strings from my editor are already utf8 encoded" probably means that perl sees your strings as a string of bytes which *happen* to be utf-8 byte sequences, and that probably isn't what you want. Unicode handling in Perl can get very confusing because Perl requires the programmer to keep track of which strings are bytes and which are unicode, and also keep track of which APIs expect to receive bytes or strings of unicode. If you want to type unicode string literals and have perl understand them as unicode text, you should declare "use utf8;" at the top of your script. Otherwise you have declared an array of bytes, and if you pass that to an API expecting unicode, your strings could get double-encoded.

    Back to your XS method, using SvPVbyte means that your XS library API needs to document that it operates on "byte strings which are expected to be a valid utf-8 encoding". Maybe this is what you want? but it will crash if a user passes it perl's understanding of unicode, e.g. uppercase_utf8_2("\x{100}").

    And finally, I'm curious how this library is an improvement over perl's own 'uc' operator. Does perl incorrectly handle some cases?

      This looks like a weird situation. The ß is kind of funky "s" and there is no uppercase version of this single lowercase letter. It is normally translated to 2 upper case "S" symbols. straße => STRASSE. This creates a pronunciation exception in German. The "a" preceding the "s" is pronounced differently depending upon whether one or two consonants follow it. This is a weird thing, but the string gets longer when capitalized. I am not sure what uc() does. Anyway was thinking that this has something to do with more memory being allocated and perhaps lost.

        ”… there is no uppercase version of this single lowercase letter…”

        Unicode Character “&#7838;” (U+1E9E) - Latin Capital Letter Sharp S

        Since 6/24/2008 in ISO/IEC 10646

        The ß is kind of funky "s"

        It is actually a ligature of s and z, or at least, it started as one. That also gave it its name, Eszett: s-z. It is way more obvious in Fraktur, where you have two different forms of the lower case s. The "short" s that looks familiar and is generally used at the end of syllables, and the long s that is generally used at the beginning or in the middle of a syllable. It looks more or less like an f without the horizontal line. For the sharp s (which is also an alternative name for the Eszett), the s was doubled, depending on time and font, either as two long s or two short s or a long and a short s. The combination of long s and short s was alternatively printed as long s and z, which were merged in a ligature. In the following years, ß and ss became slightly different, annoying generations of students. The 1996 orthography reform attempted to get rid of ß in many places.

        and there is no uppercase version of this single lowercase letter.

        There are reasons: The upper case s was always S, for both long s and short s. The sharp s, written as ss (two longs, two shorts, or one long and one short) would always be written in upper case as SS. No extra rules or letters needed. The alternative form sz, printed as ligature of long s and z, would be written in upper case as SZ. Again, no extra rules or letters needed. But then, people started to treat the s-z ligature as a new and unique letter and forgot that it was a ligature. That caused the "strange" rule of "converting" ß to SS when converting to upper case, except where misunderstandings may happen, in that case, ß should be "converted" to what it represents, SZ. That rule is rarely used, most times, context is sufficient. Maße (measurements) and Masse (mass) are a classic example, both can be written as MASSE, but if misunderstandings may happen, Maße must be written as MASZE.

        At this point, rules for converting to upper case become really hard for computers. And so, ß was finally treated as a regular letter instead of a ligature and got its own dedicated upper case form (see Re^3: Memory Leak with XS but not pure C). The allocation in Unicode is a little but far away from ß and the other glyphs used in German, keyboard support sucks (Shift-ß gives ?, not the upper case ß), but at least, there is an upper case ß, now that the new orthography tried to eliminate it.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Yes Marshall, German is a great language eh?

      Thanks Nerdvana for your fast and very helpful reply!

      Using newSVpvn in place of newSVpv does indeed solve the complaint from Valgrind. Also you were of course correct about the extraneous u8_strlen

      I'm the epitomy of confusedness regarding Perl and unicode. Your tip to add "use utf8" was very useful as I'm passing the literal strings, and I added a check to the XS code prior to getting the string from the SV:

      if(!SvUTF8(sv)) { sv = sv_mortalcopy(sv); sv_utf8_upgrade(sv); } s = SvPVutf8(sv, len)

      As Marshall says below, the esszett is a strange character (well, it is German) as it uppercases to 'SS'. This happens with many characters of other languages too. The standard Perl uc just leaves it there when uppercasing. The libunisting library (not mine!) does it correctly.

      Thanks again for the help!

        The standard Perl uc just leaves it there when uppercasing.

        Any currently supported perl should uppercase it correctly:

        $ cat uct.pl 
        #!/usr/bin/env perl
        use strict;
        use warnings;
        use utf8;
        
        my $string = 'straße';
        my $ucstring = uc $string;
        print "Uppercase $string is $ucstring\n";
        $ perl uct.pl 
        Uppercase straße is STRASSE
        $
        

        Are you running an old version?


        🦛

        the esszett is a strange character (well, it is German) as it uppercases to 'SS'. This happens with many characters of other languages too. The standard Perl uc just leaves it there when uppercasing.

        I suspect you use uppercase for case-insensitive comparison - which is not correct. foldcase would be the way to go as fc "ß" is indeed "ss".

        Otherwise, maybe uc fc produces your desired result?

        Greetings,
        🐻

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

        If you were feeding the 'uc' operator a string of utf8 bytes from your editor which perl had not been informed was intended as unicode, then perl would apply ascii uppercasing rules to that string of bytes. Now that you have the "use utf8" in your file, I think you'll find that 'uc' works properly on that string. But, you'll also find that perl warns you if you try to print that string, because in the default configuration the output streams expect bytes as input. You can either use binmode(STDOUT, 'encoding(UTF-8)') to declare that you intend to always write unicode to the file handle, or remember to encode the string before printing.

        Full unicode support exists in perl, but yeah it's kind of a learning curve to find it :-(   But that's the price we pay for full multi-decade back-compat.

Re: Memory Leak with XS but not pure C
by bliako (Abbot) on Mar 28, 2025 at 19:47 UTC

    I have no idea if your code is leaky but Perl runs its own memory pool and garbage collection and this causes valgrind to show lots of memory not free'ed at the end (as NERDVANA also said). There is a way to tell perl to free all memory at the end:

    PERL_DESTRUCT_LEVEL If you want to run any of the tests yourself manually using e.g. valgr +ind, please note that by default perl does not explicitly clean up al +l the memory it has allocated (such as global memory arenas) but inst +ead lets the exit() of the whole program "take care" of such allocati +ons, also known as "global destruction of objects". There is a way to tell perl to do complete cleanup: set the environmen +t variable PERL_DESTRUCT_LEVEL to a non-zero value. The t/TEST wrappe +r does set this to 2, and this is what you need to do too, if you don +'t want to see the "global leaks": For example, for running under val +grind env PERL_DESTRUCT_LEVEL=2 valgrind ./perl -Ilib t/foo/bar.t
    (see PERL_DESTRUCT_LEVEL)

    There are lots of situations where valgrind complains in this way, e.g. with Qt or Xlib. "Valgrind suppression files" are used to tell valgrind what not to report for each of these situations. I am not sure if there is one for perl but perhaps Test::Valgrind can do this and without messing with those pesky command line switches.

    Finally, I would create the simplest XS code which does nothing at all and pass that through valgrind in order to confirm that what I said indeed applies and how to suppress/filter-in the good stuff.

      Thanks bialko

      You are indeed correct. Thanks for the tip