Dealing with data that comes from webpages can be really complicated. There is likely to be a combination of ASCII, UTF-8, and wide characters in the data returned.
ASCII is valid UTF-8 so you cannot have a combination of UTF-8 and ASCII in a string. You just have UTF-8. Wide-characters is ambiguous here. It seems to mean broken/unknown bytes that are putative character data. This doesn't happen much in the wild anymore. When it does you see pages littered with �s. So, I don't think that this situation is "likely." I can't think of the last time I saw it.
This "Hello\x{26c4}".encode("utf-8","\x{26f0}")."\x{10102}\x{2fa1b}" is broken on purpose (concatted perl UTF-8, binary UTF-8, and perl UTF-8). This can only happen through incorrect handling of character data encodings which, I assert, is fairly uncommon on the web today.
Perhaps I am misunderstanding. Can you give a live example of a site that your tool is meant to fix.
Update: s/ASII/ASCII/;
In reply to Re: Safe string handling
by Your Mother
in thread Safe string handling
by tdlewis77
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |