Re: Safe string handling

Dealing with data that comes from webpages can be really complicated. There is likely to be a combination of ASCII, UTF-8, and wide characters in the data returned.

ASCII is valid UTF-8 so you cannot have a combination of UTF-8 and ASCII in a string. You just have UTF-8. Wide-characters is ambiguous here. It seems to mean broken/unknown bytes that are putative character data. This doesn't happen much in the wild anymore. When it does you see pages littered with �s. So, I don't think that this situation is "likely." I can't think of the last time I saw it.

This "Hello\x{26c4}".encode("utf-8","\x{26f0}")."\x{10102}\x{2fa1b}" is broken on purpose (concatted perl UTF-8, binary UTF-8, and perl UTF-8). This can only happen through incorrect handling of character data encodings which, I assert, is fairly uncommon on the web today.

Perhaps I am misunderstanding. Can you give a live example of a site that your tool is meant to fix.

Update: s/ASII/ASCII/;

Comment on Re: Safe string handling Download Code

Replies are listed 'Best First'.
Re^2: Safe string handling by tdlewis77 (Sexton) on Aug 26, 2017 at 00:50 UTC
I used "wide characters" here in the same way that Perl does when it says "Wide character in print". You can have two-byte ("\x{26c4}") and four-byte ("\x{2fa1b}") wide characters. https://en.wikipedia.org/wiki/Wide_character	[reply]
Re^3: Safe string handling by Your Mother (Archbishop) on Aug 26, 2017 at 02:49 UTC
This is an output layer encoding problem though; no more, no less. I think you have probably evolved your practices based on incomplete understanding of encoding issues. I encourage you to post an actual problem you think this solves so the monks can better advise.	[reply]
Re^2: Safe string handling by tdlewis77 (Sexton) on Aug 26, 2017 at 00:44 UTC
This tool has been evolving over the course of several years. Every time I encounter some weirdness that breaks it, I've enhanced it. I recently rewrote it from scratch to incorporate everything I learned along the way. Offhand I can't tell you that there is a single site that has all the weirdness in my "broken on purpose" example, however, I can tell you that I've encountered websites that have mixed things up in ways that they were never intended. At this point, I think my tool handles everything I've ever encountered and is ready for anything that I haven't yet encountered. Even if you've only encountered well-behaved websites, there still is way to tell Perl to give you the sixth UTF-8 character from a string as in the "$snowman" example.	[reply]
Re^3: Safe string handling by RonW (Parson) on Aug 28, 2017 at 22:08 UTC
Can you give us URLs to some example websites?	[reply]
Re^4: Safe string handling by Anonymous Monk on Aug 28, 2017 at 22:22 UTC
Betcha the OP is decoding entity references without first decoding utf-8. That would produce the "mixed" encoding he's claiming to see.	[reply]