You're getting stuff as cp1252 — "’" is 92 in cp1252 — but you're outputting it as is in a document you claim is UTF-8.
Always decode your inputs. Always encode your outputs. You are apparently doing neither.
Note that the quote is character U+2019, so the proper escape is ’ or ’, not \.
If you pass properly decoded text to the following function, it will produce 7-bit clean UTF-8 (aka US-ASCII) XML text and XML attribute values.
sub encode_entities { my ($self, $text) = @_; $text =~ s/&/&/g; $text =~ s/</</g; $text =~ s/>/>/g; $text =~ s/"/"/g; $text =~ s/'/'/g; $text =~ s/([^\x20-\x7E])/sprintf("&#x%X;", ord($1))/eg; return $text; }
In reply to Re: Cleaning up non 7-bit Ascii Chars for XML-processing
by ikegami
in thread Cleaning up non 7-bit Ascii Chars for XML-processing
by liverpole
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |