I am hoping for an automated way to normalize incoming text.
Well, my suggestions for guessing encoding still apply, plus looking at the meta tags in the HTML might help (with the same caveat that it might be wrong). But again, for specific help with the specific issue that you wrote about in the root node, you'll have to show us some debug output.
In reply to Re^5: Safely removing Unicode zero-width spaces and other non-printing characters
by haukex
in thread Safely removing Unicode zero-width spaces and other non-printing characters
by mldvx4
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |