in reply to Safe string handling
Dealing with data that comes from webpages can be really complicated. There is likely to be a combination of ASCII, UTF-8, and wide characters in the data returned.
ASCII is valid UTF-8 so you cannot have a combination of UTF-8 and ASCII in a string. You just have UTF-8. Wide-characters is ambiguous here. It seems to mean broken/unknown bytes that are putative character data. This doesn't happen much in the wild anymore. When it does you see pages littered with �s. So, I don't think that this situation is "likely." I can't think of the last time I saw it.
This "Hello\x{26c4}".encode("utf-8","\x{26f0}")."\x{10102}\x{2fa1b}" is broken on purpose (concatted perl UTF-8, binary UTF-8, and perl UTF-8). This can only happen through incorrect handling of character data encodings which, I assert, is fairly uncommon on the web today.
Perhaps I am misunderstanding. Can you give a live example of a site that your tool is meant to fix.
Update: s/ASII/ASCII/;
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Safe string handling
by tdlewis77 (Sexton) on Aug 26, 2017 at 00:50 UTC | |
by Your Mother (Archbishop) on Aug 26, 2017 at 02:49 UTC | |
|
Re^2: Safe string handling
by tdlewis77 (Sexton) on Aug 26, 2017 at 00:44 UTC | |
by RonW (Parson) on Aug 28, 2017 at 22:08 UTC | |
by Anonymous Monk on Aug 28, 2017 at 22:22 UTC |