Strings with umlauts and such

PeterKaagman has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

One of the recurring things I need to do each year is create new students in our AD domain. This year it's a bit different because there will be no AD Domain anymore, just an Azure AD. So our regular tool do not work anymore and I thought I write up my own using Perl and MS Graph to create the users.

Creating the user based upon our school information system turned out to be not verry hard... just one snag: parents keep coming up with beautifull names containing all sorts characters which don't fit in to ASCII. I know, welcome to the 21st century. You'll have to use unicode I guess. Been avoiding that for a long time. Tried to read up on it recently, but is confuses the hell out of me

Would very much welcome some practical hints on how to go about getting decent names in our AAD.

This is kinda what I would like to do:

For testing purposes I've put some users in an HoH, one of them looking like:

'b232124@ict-atlascollege.nl' => {
            'naam'      => 'Kühne, Jan',
            'v_naam'    => 'Jan',
            'tv'        => '',
            'a_naam'    => 'Kühne',
            'studie'    => '2V6',
            'stam_nr'   => '231224',
            'klas'      => '',
            'locatie'   => 'Copernicus SG'
        }

This resembles the data I would get from the school information system.
To make the HTTP POST I make this into an payload like this:

# payload maken
my $fullName = $magisterLLN->{$upn}->{'v_naam'};
$fullName   .= " $magisterLLN->{$upn}->{'tv'}" if ($magisterLLN->{$upn}->{'tv'});
$fullName   .= " $magisterLLN->{$upn}->{'a_naam'}";
my $payload = {
            "accountEnabled"    => \1,
            "displayName"       => decode('UTF-8',$fullName),
            "mailNickname"      => "b".$magisterLLN->{$upn}->{'stam_nr'},
            "userPrincipalName" =>  "b".$magisterLLN->{$upn}->{'stam_nr'}.'@ict-atlascollege.nl',
            "passwordProfile"   => {
                "forceChangePasswordNextSignIn" => \1,
                "password"                      => "zeer_geheim_2024"
            },
};

Next I use LWP to POST the json encoded payload to graph.
In this example I decode the fullname using UTF-8 which to my surprise gave me the best result yet, but I still end up with something like "K?hne" in Azure

Tried encoding and decoding on different stages of the process, but nothing seems to be doing the right thing.

Would appreciate you help a lot.

Comment on Strings with umlauts and such

Replies are listed 'Best First'.
Re: Strings with umlauts and such by hippo (Archbishop) on Aug 13, 2024 at 10:26 UTC
For testing purposes I've put some users in an HoH That was a good idea. However, you do not say if you have also written `use utf8;` in your source. Did you do that? but I still end up with something like "K?hne" in Azure Try posting it to something which you control rather than Azure. That way you can determine if the problem is internal to Azure or not. Even just saving it to a local file with a known encoding would do for starters. Unicode looks big and scary the first time you have to deal with it but it's not so bad in the long run. Good luck. (edited for typo) 🦛	[reply] [d/l]
Re^2: Strings with umlauts and such by PeterKaagman (Beadle) on Aug 13, 2024 at 10:49 UTC
When reading up on unicode I came across something I took as a warning not to use "use UTF-8". So I did not. The remark about using utf-8 in the code did confuse me: using Kühne in a string is using utf-8 in the code? Must give your suggetion to take Azure out of the equating some thought, not really sure how to go about that.	[reply]
Re^3: Strings with umlauts and such by hippo (Archbishop) on Aug 13, 2024 at 11:12 UTC
using Kühne in a string is using utf-8 in the code? It is, if it is written with utf-8 encoding. `use utf8;` tells perl that your source code includes literal utf-8 characters. If you don't include that but insert literal utf-8 characters in your code, all manner of bad things will ensue. See utf8. There is almost no reason not to `use uft8;` in all your code these days. 🦛	[reply] [d/l] [select]
Re: Strings with umlauts and such by NERDVANA (Priest) on Aug 15, 2024 at 04:50 UTC
The most important thing to know about Unicode in perl is that you are responsible for remembering whether a particular scalar contains "bytes", or "Unicode characters". Perl does not track this for you. If you lose track of this, you're going to have a confusing time. So to start, if you write a unicode character in your perl source file, you have "bytes" (which may contain a valid utf-8 sequence from your code editor, but it's still "bytes"). If you put `use utf8;` at the start of your file, then perl understand that you are defining all your strings with "unicode characters". "encode" means that you take Unicode Characters, and convert them to utf-8 byte sequences. i.e. the input is characters, the output is bytes. Decode is the opposite. (you may know that perl stores unicode characters internally as utf-8, but this is an implementation detail and only confuses the issue. Think in terms of characters and bytes.) When you perform I/O, including "warn" and "print" and "readline", you are reading/writing bytes, unless you put the `binmode $fh, ":encoding(UTF-8)"` layer on that file handle. If you have a string known to be unicode characters, and you want to write it to a file handle without the special encoding layer, you need to "utf8::encode($str)" it yourself before writing. When you talk to a database, you should make sure that you know whether you are receiving unicode or bytes for any given field. Most database drivers can be configured to correctly identify text columns (and assume them to be unicode) vs. binary columns of bytes. When you write to a web API, you need to keep track of whether that API expects characters or bytes. It all has to be bytes before it goes over the wire, but sometimes the API does that step for you. When in doubt, hex-dump the string to find out whether Perl thinks the character is `"\x{100}"` or `"\xC4\x80"`. A handy utility for this is B::perlstring, though it outputs bytes in octal rather than hex. `use B "perlstring"; my $x= "\x{100}"; say perlstring $x; utf8::encode($x); say perlstring $x;` [download]	[reply] [d/l] [select]
Re: Strings with umlauts and such by PeterKaagman (Beadle) on Aug 22, 2024 at 11:14 UTC
Thanks all for the replies, I think it cleared my mind of some confusion. Just tried to do as you guys suggested: keeping track on bytes or chars, using utf-8 in my source. And voila, things got kinda simple. Told the backend (graph) to expect utf-8 and was just able to send the scalar (containing utf-8 chars). No further encoding or decoding needed Guess I will have to make sure when I actually download information from the school information system that it is in fact utf-8 chars. Back to assigning groups and other properties to our students.... thank! Peter	[reply]
Re: Strings with umlauts and such by cavac (Prior) on Aug 14, 2024 at 08:15 UTC
As others have said, local encoding matters. But you also have to make sure that the POST request tells the server which character encoding you are using. I think the default is still ISO8859-1. For the downlink portion, the server sets a header like this for UTF8: `Content-Type: text/html; charset=utf-8` [download] For a post request, you also have to set the encoding in the client. Depending on how the content of your post is structured, you have to adapt this header to your needs: `Content-Type: application/x-www-form-urlencoded ; charset=UTF-8` [download] One easy way to do this is to look what "Content-Type" header LWP is currently sending, and then adapting it so it explicitly says "UTF-8" for the charset. It is also rather important to make sure that the data you send is, in fact, utf8 encoded. PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP Also check out my sisters artwork and my weekly webcomics	[reply] [d/l] [select]
Re: Strings with umlauts and such by PeterKaagman (Beadle) on Aug 13, 2024 at 10:31 UTC
Not that it really satifies me... but I did get something working... right after asking Turned out the it does work when I decode using UTF-8 when I create the request. (And removing all the left over decodes and encodes scatttered in my code). Just don't get why this works. Decode seems to me the opposite of what I need.	[reply]