The XML is always output as "UTF-8"
No it isn't.
"’" is "E2 80 99" in UTF-8.
"’" is "92" in cp1252.
You've indicated you have the latter.
You've indicated the document claims to be the former (implicitly).
You can either fix the encoding, or fix what the XML says the encoding is. The former is easier.
use strict; use warnings; use Encode qw( encode decode ); sub fix_broken_text { my ($self, $field) = @_; $field =~ s/&/&/g; $field =~ s/</</g; $field =~ s/>/>/g; $field =~ s/"/"/g; $field =~ s/'/'/g; return $field; } my $decoded_xml; { open(my $fh, '<', $xml_qfn) or die; binmode($fh); local $/; $xml = decode('cp1252', scalar(<$fh>)); } ...Try to fix problems with unescaped characters... my $encoded_xml = encode('UTF-8', $decoded_xml); ...Pass $encoded_xml to parser...
If only parts are cp1252,
use strict; use warnings; use Encode qw( encode decode ); sub fix_broken_text { my ($self, $field) = @_; $field = decode('cp1252', $field); $field =~ s/&/&/g; $field =~ s/</</g; $field =~ s/>/>/g; $field =~ s/"/"/g; $field =~ s/'/'/g; $field = encode('UTF-8', $field); return $field; } my $encoded_xml; { open(my $fh, '<', $xml_qfn) or die; binmode($fh); local $/; } ...Try to fix problems with unescaped characters... ...Pass $encoded_xml to parser...
In reply to Re^4: Cleaning up non 7-bit Ascii Chars for XML-processing
by ikegami
in thread Cleaning up non 7-bit Ascii Chars for XML-processing
by liverpole
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |