in reply to stripping characters from html
I had this problem turning TAP output containing Unicode into JUnit XML; the solution was to translate any character outside of the printable ASCII range (except for newline and carriage return) to an &#xx; sequence, and the same for any embeded ", &, <, and > characters. The XML parser was then happy with that character set.
This is useful if you can't trust the encoding of the input to be right, since ASCII is kind of the "least common denominator" when it comes to character sets.
|
|---|