Re: Perl strings questions

Corion and choroba have given good advice, so just some extras:

my $unicode_string = "αβγαabc123"; # unicode chars mixed with lower-ascii

This is not a unicode string. (choroba notes below that it originally was, but PerlMonks mangles these within code blocks) It is a HTML encoding of a unicode string in plain ASCII. To get a unicode string from this, do:

use HTML::Entities;
my $html_string = "&#945;&#946;&#947;&#945;abc123";
my $unicode_string = decode_entities($html_string);
[download]

2) Also, I have a question about why I need to encode in "UTF-8". Does that make sure that the "double-bytes" and possible "single-bytes" are all becoming a stream of "single-bytes"?

After UTF-8 encoding, you end up with a stream of single-byte characters.

However, you do not strictly need to encode in UTF-8. The text encoding is an agreement between the sender and the receiver, this can either be done by explicit specification (example: Content-Type: text/html; encoding=utf8), by some standard or defaults (examples: XML defaults to UTF-8, HTML defaults to ISO-8859-1), or just by the developers talking to each other over a beer (not recommended). These days, UTF-8 is highly recommended because it is able to represent any unicode character in a consistent way.

I am trying to find the safe way to do things when strings are mixed,

My recommendation: Just don't do that. Text and binary data don't mix well in a simple string. Finding the borders is hard to do in a safe way since binary data may occasionally look like text.

Comment on Re: Perl strings questions Download Code

Replies are listed 'Best First'.
Re^2: Perl strings questions by choroba (Cardinal) on Jun 02, 2021 at 11:10 UTC
> This is not a unicode string It was, but PerlMonks can't display it in a `<code>` block :-( `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^3: Perl strings questions by haj (Vicar) on Jun 02, 2021 at 11:23 UTC
Thanks for the info! I've added it to my article (a `use utf8;` would then be appropriate). So when I write `αβγαabc123` inline, then it displays as αβγαabc123 - and when I write `αβγαabc123` in a code block it displays as `αβγαabc123`. I didn't know that.	[reply]
Re^2: Perl strings questions by bliako (Abbot) on Jun 02, 2021 at 12:29 UTC
I have update my post re: mixing strings	[reply]
Re^3: Perl strings questions by haj (Vicar) on Jun 02, 2021 at 14:25 UTC
Thanks, that clarifies some things. Yet, the python code does not mix text and binary. As far as I read the code, the binary stuff is BASE64 encoded. Well, yes, unfortunately "encoding" is used for a lot of things. Let me try to explain the difference: UTF-8 is an encoding to map unicode characters to bytes. Unicode characters are identified by their code point. For ASCII characters and control characters like `LINE FEED`, their code point is equal to their "traditional" byte value, and also to their UTF-8 mapping. Perl's interface identifies characters by its code point, so you get a lowercase greek alpha by `chr(945)` or by "\x{3b1}". You also can use the names as in choroba's example: `"\N{GREEK SMALL LETTER ALPHA}"`. BASE64 is an encoding to map a stream of bytes, each of which in the range 0..255 ("binary data"), to a stream of bytes, each of which representing an ASCII character, The result happens to be valid UTF-8 (see above). Binary data will in most cases contain bytes in the range 128..255. Their UTF-8 encoding is not equal to their byte value. If you encode such bytes in UTF-8, it is like Perl interpreting their byte values as code points: Unicode has code points in that range with (not so) surprising similarity to ISO-8859-1. The code point for ö is `U+00F6`, but its UTF-8 encoding has two bytes `X'C3B6'`. So, if you encode binary data in UTF-8, the result is different, the process is deterministic and it is reversible. However, it depends on the receiving side to do a decoding of an UTF-8 stream into binary data and not into a unicode string. Perl happens to do that (because, as you wrote, it makes no difference), but not many other languages do. In general, you can not decode an UTF-8 stream into binary if it contains one or more characters with a code point greater than 255.	[reply]
Re^4: Perl strings questions by bliako (Abbot) on Jun 02, 2021 at 17:32 UTC
Re: `python does not mix binary and string`, well I thought this `hashlib.sha256(encoded).digest()` was a binary hash. I am positive that I printed it to see, but right now I have no time, so I will update tomorrow. The rest is very useful and I will read it tomorrow. UPDATE: the digest() prints out `b'+\xbd\xd0Z\xda;\x05\xbb\x80\x058(.' etc`	[reply] [d/l] [select]