in reply to Memory Leak with XS but not pure C

If newSVpv is given a length of zero, it calls strlen on that pointer. You probably want newSVpvn. If your u8 library was given an actual string which was length 0, then it may have called malloc with a length zero, and malloc is permitted to return a magic pointer value that indicates no allocation needs freed, which means you can't legally read the first byte of it during strlen, which seems like what valgrind is complaining about.

Also, why does your code call u8_strlen when you already have the length?

For the rest (33K mallocs, 24K frees), I'm guessing perl doesn't bother to deeply free every data structure as it exits, since that would just waste time when the OS will clean it up anyway.

Update

Also, I suspect you should be using SvPVutf8 instead of SvPVbyte. The comment that "strings from my editor are already utf8 encoded" probably means that perl sees your strings as a string of bytes which *happen* to be utf-8 byte sequences, and that probably isn't what you want. Unicode handling in Perl can get very confusing because Perl requires the programmer to keep track of which strings are bytes and which are unicode, and also keep track of which APIs expect to receive bytes or strings of unicode. If you want to type unicode string literals and have perl understand them as unicode text, you should declare "use utf8;" at the top of your script. Otherwise you have declared an array of bytes, and if you pass that to an API expecting unicode, your strings could get double-encoded.

Back to your XS method, using SvPVbyte means that your XS library API needs to document that it operates on "byte strings which are expected to be a valid utf-8 encoding". Maybe this is what you want? but it will crash if a user passes it perl's understanding of unicode, e.g. uppercase_utf8_2("\x{100}").

And finally, I'm curious how this library is an improvement over perl's own 'uc' operator. Does perl incorrectly handle some cases?

Replies are listed 'Best First'.
Re^2: Memory Leak with XS but not pure C
by Marshall (Canon) on Mar 29, 2025 at 04:53 UTC
    This looks like a weird situation. The ß is kind of funky "s" and there is no uppercase version of this single lowercase letter. It is normally translated to 2 upper case "S" symbols. straße => STRASSE. This creates a pronunciation exception in German. The "a" preceding the "s" is pronounced differently depending upon whether one or two consonants follow it. This is a weird thing, but the string gets longer when capitalized. I am not sure what uc() does. Anyway was thinking that this has something to do with more memory being allocated and perhaps lost.

      ”… there is no uppercase version of this single lowercase letter…”

      Unicode Character “ẞ” (U+1E9E) - Latin Capital Letter Sharp S

      Since 6/24/2008 in ISO/IEC 10646

        This letter was created in 2008 and standardized in the german language in 2017, but usage is optional, with "SS" the standard for uppercasing ß. Which makes that new letter the only letter of the german language that is not available on standard german keyboards. Great. Just great. Another perfect use of my taxpayer money.

        And no, that special letter is currently not fully supported in my commercial software either, because the font i use for printing invoices on thermal paper doesn't support it.¹

        Oh well, that's the german language. To misquote Kennedy: "We speak german, not because it is easy, but because it is hard. Because that challenge is one that we are forced to accept, one we are unable to postpone, and one we intend to fail at miserably."

        And we do. Only a fraction of native german speakers actually speak german. Most (including me) speak a dialect of german. Especially in Austria, my home. When people from different Austrians states meet, it is an awesome thing to listen in. Everybody speaks a completely different dialect, and somehow we mostly manage to understand one another. (If someone joins who has only learned german as a second language, they might be in for a truly baffling experience, though.).

        Sidenote: I have seen a few Austrian movies played on German TV stations with german subtitles (notably: "Hinterholz 8"), which was one of the funniest experiences ever.


        ¹ It's astonishingly hard to find a readable, modern fixed width font that looks good and can be scaled down well enough that you can print all required text on an invoice, when you only have 512 pixels in width, on paper thats only 80mm (3.1 inch) wide. And that is still readable if you scale the image down to 384 pixel (50mm / 1.9 inch) for printing on a mobile bluetooth printer. Searching for a font that can all that and that supports some special letter that nobody uses anyway is an excersice for another decade...

        PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        Also check out my sisters artwork and my weekly webcomics
      The ß is kind of funky "s"

      It is actually a ligature of s and z, or at least, it started as one. That also gave it its name, Eszett: s-z. It is way more obvious in Fraktur, where you have two different forms of the lower case s. The "short" s that looks familiar and is generally used at the end of syllables, and the long s that is generally used at the beginning or in the middle of a syllable. It looks more or less like an f without the horizontal line. For the sharp s (which is also an alternative name for the Eszett), the s was doubled, depending on time and font, either as two long s or two short s or a long and a short s. The combination of long s and short s was alternatively printed as long s and z, which were merged in a ligature. In the following years, ß and ss became slightly different, annoying generations of students. The 1996 orthography reform attempted to get rid of ß in many places.

      and there is no uppercase version of this single lowercase letter.

      There are reasons: The upper case s was always S, for both long s and short s. The sharp s, written as ss (two longs, two shorts, or one long and one short) would always be written in upper case as SS. No extra rules or letters needed. The alternative form sz, printed as ligature of long s and z, would be written in upper case as SZ. Again, no extra rules or letters needed. But then, people started to treat the s-z ligature as a new and unique letter and forgot that it was a ligature. That caused the "strange" rule of "converting" ß to SS when converting to upper case, except where misunderstandings may happen, in that case, ß should be "converted" to what it represents, SZ. That rule is rarely used, most times, context is sufficient. Maße (measurements) and Masse (mass) are a classic example, both can be written as MASSE, but if misunderstandings may happen, Maße must be written as MASZE.

      At this point, rules for converting to upper case become really hard for computers. And so, ß was finally treated as a regular letter instead of a ligature and got its own dedicated upper case form (see Re^3: Memory Leak with XS but not pure C). The allocation in Unicode is a little but far away from ß and the other glyphs used in German, keyboard support sucks (Shift-ß gives ?, not the upper case ß), but at least, there is an upper case ß, now that the new orthography tried to eliminate it.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Die Verlegenheit hat Erik Spiekermann beschrieben, der Grandseigneur unter den deutschen Schriftgestaltern: „Ich mag die Idee eines großen ß, aber ich habe noch nirgendwo eine überzeugende Form gesehen.“ Das dürfte der Grund dafür sein, dass bislang nur sehr wenige der gängigen Schriftarten überhaupt über einen ß-Großbuchstaben verfügen. In der Regel sitzt man nämlich vor seiner Tastatur, tippt Shift, AltGr und ß und sieht: nichts. Man erinnere sich dann an die letzte Scrabble-Partie und an Friedrich Forssman: „Tiefes Lesen geht nur, wenn der Text unsichtbar ist.“

        Erik Spiekermann, the grand seigneur among German typeface designers, has described the embarrassment: "I like the idea of a capital ß, but I haven't seen a convincing form anywhere." This is probably the reason why only very few of the current fonts have a capital ß at all. As a rule, you sit in front of your keyboard, type Shift, AltGr and ß and see: nothing. Remember the last game of Scrabble and Friedrich Forssman: "Deep reading is only possible if the text is invisible."

        Buchstabe ẞ: Formprobleme der deutschen Sprache

        Erik Spiekermann

      Yes Marshall, German is a great language eh?

        German is a great language eh?

        Sure it is. There are so many crazy rules to learn that it is only beaten by the complete mismatch of language and spelling in English, the ridiculous amount of completely silent extra letters at the end of French words, and the number of inflection rules in Latin. It's so hard that even native speakers can have a hard time using it properly.

        Examples?

        Refer to a young girl as "Mädchen". That's the diminutive form of "Magd" (maid), but that is generally long forgotten. You can still see it by the "-chen" suffix. Because it is a diminutive, the grammatical gender changes from feminine to neuter. That's just grammatical, no implications about biology, social, cultural gender. And so, if you want to refer to that "Mädchen" in the next sentence, you must use the neuter pronoun "es", not the feminine pronoun "sie". If you use "sie", you are doing it wrong. That error is quite common, even for native speakers, even for professional speakers (like the presenters of the Tagesschau).

        Comparing criteria. If the amount is the same, you use "wie": "A hat genau so viele Äpfel wie B". If the amount is less or more, you use "als": "A hat mehr Orangen als B", "B hat weniger Orangen als A". Same as in English: "A has as much apples as B", "A has more oranges than B", "B has less oranges than A". But getting "wie" and "als" right is hard, because of regional differences. Many native speakers can't get their head around using "als" when comparing. They always use "wie", and that error more and more also happens to professional speakers.

        Grinding vs. looping. Grinding, reducing the thickness of some material by abrasive tools, ist "schleifen", past tense form "geschliffen". Grinding to cut material, intentionally or not, is "durchschleifen", past tense "durchgeschliffen" (strong inflection). Tie a cable across a rough, spinning wheel, and some sand and water, and the cable will be cut through in no time. The cable is "durchgeschliffen". A loop is "Schleife". Forming a loop, especially when handling electrical signals, e.g. into one device, then out of that device and into the next device in a chain, is "durchschleifen". Same letters and same sound as the grinding process, but a completely different base and a completely different meaning. Past tense is "durchgeschleift" (weak inflection). You can still see the "Schleife" in that word. Professionals started to intentionally use the wrong conjugation "durchgeschliffen" for fun, and many other people picked up that wrong form, not even knowing about the loop. Professional speakers rarely get that one wrong.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^2: Memory Leak with XS but not pure C
by FrankFooty (Novice) on Mar 29, 2025 at 12:57 UTC

    Thanks Nerdvana for your fast and very helpful reply!

    Using newSVpvn in place of newSVpv does indeed solve the complaint from Valgrind. Also you were of course correct about the extraneous u8_strlen

    I'm the epitomy of confusedness regarding Perl and unicode. Your tip to add "use utf8" was very useful as I'm passing the literal strings, and I added a check to the XS code prior to getting the string from the SV:

    if(!SvUTF8(sv)) { sv = sv_mortalcopy(sv); sv_utf8_upgrade(sv); } s = SvPVutf8(sv, len)

    As Marshall says below, the esszett is a strange character (well, it is German) as it uppercases to 'SS'. This happens with many characters of other languages too. The standard Perl uc just leaves it there when uppercasing. The libunisting library (not mine!) does it correctly.

    Thanks again for the help!

      The standard Perl uc just leaves it there when uppercasing.

      Any currently supported perl should uppercase it correctly:

      $ cat uct.pl 
      #!/usr/bin/env perl
      use strict;
      use warnings;
      use utf8;
      
      my $string = 'straße';
      my $ucstring = uc $string;
      print "Uppercase $string is $ucstring\n";
      $ perl uct.pl 
      Uppercase straße is STRASSE
      $
      

      Are you running an old version?


      🦛

        uc suffers from The Unicode Bug when the unicode_strings feature isn't in enabled.

        It works correctly (giving SS for ß) when the string the unicode_strings feature is enabled.

        It works correctly (giving SS for ß) when the string is stored in the UTF8=1 format.

        It works incorrectly (ß unchanged) otherwise.

        use open ":std", ":locale"; use feature qw( say ); my $ss = "\xDF"; utf8::upgrade( my $ss_u = $ss ); utf8::downgrade( my $ss_d = $ss ); { no feature qw( unicode_strings ); say uc( $ss_d ); # ß say uc( $ss_u ); # SS } { use feature qw( unicode_strings ); say uc( $ss_d ); # SS say uc( $ss_u ); # SS }
      the esszett is a strange character (well, it is German) as it uppercases to 'SS'. This happens with many characters of other languages too. The standard Perl uc just leaves it there when uppercasing.

      I suspect you use uppercase for case-insensitive comparison - which is not correct. foldcase would be the way to go as fc "ß" is indeed "ss".

      Otherwise, maybe uc fc produces your desired result?

      Greetings,
      🐻

      $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      If you were feeding the 'uc' operator a string of utf8 bytes from your editor which perl had not been informed was intended as unicode, then perl would apply ascii uppercasing rules to that string of bytes. Now that you have the "use utf8" in your file, I think you'll find that 'uc' works properly on that string. But, you'll also find that perl warns you if you try to print that string, because in the default configuration the output streams expect bytes as input. You can either use binmode(STDOUT, 'encoding(UTF-8)') to declare that you intend to always write unicode to the file handle, or remember to encode the string before printing.

      Full unicode support exists in perl, but yeah it's kind of a learning curve to find it :-(   But that's the price we pay for full multi-decade back-compat.

        yeah your are right

        . This will be part of a bigger XS thing. Is there a macro I can use for uppercasing?