comment on

I am processing some XHTML pages (using XML::Twig) that contain numerous character entities, such as:
Ã©
[download]
Sorry to be dumb here, but where am I treating it as Latin-1? How do I change it?

The problem is in your original XHTML file, because it has the literal 12‑byte ASCII sequence

So something somewhere somewhen took a UTF‑8 file and replaced not each complete multibyte character with its single entity, but rather each individual component byte as the Latin‑1 code point number.

This may have happened because some program read an undecoded binary byte stream and never decoded it before trying to convert non‑ASCII into numeric entities. For example, here I use -CS in the first process to say it’s UTF‑8 but then lie to the second one by using -C0 to say that it isn’t. That would produce the sort of thing that you saw:

$  perl -CS -le 'print "na\x{EF}ivet\x{E9}"' 
naïveté

$  perl -CS -le 'print "na\x{EF}ivet\x{E9}"' | 
   perl -C0 -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge'
na&#195;&#175;ivet&#195;&#169;

$  perl -CS -le 'print "na\x{EF}ivet\x{E9}"' | 
   perl -C0 -pe 's/(\P{ASCII})/sprintf "&#x%02X;", ord($1)/ge'
na&#xC3;&#xAF;ivet&#xC3;&#xA9;
[download]

Compare with the right answers:

$  perl -CS -le 'print "na\x{EF}vet\x{E9}"' | 
   perl -CS -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge'
na&#239;vet&#233;

$ perl -CS -le 'print "na\x{EF}vet\x{E9}"' | 
  perl -CS -pe 's/(\P{ASCII})/sprintf "&#x%02X;", ord($1)/ge'
na&#xEF;vet&#xE9;
[download]

So what you really need to do is track down whatever errant procedure caused this mess in the first place, and fix that, since it will never work right that way.

This demonstrates putting it back to UTF-8:

$ perl -CS -le 'print "na\x{EF}vet\x{E9}"' | 
  perl -C0 -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge' | 
  perl -C0 -pe 's/&#(\d+);/chr($1)/ge'
naïveté
[download]

And this, heaven help you, demonstrates doing that and then doing the entities the right way around in the first place:

$ perl -CS -le 'print "na\x{EF}vet\x{E9}"' |
  perl -C0 -pe 's/(\P{ASCII})/"&#".ord($1).";"/ge' | 
  perl -MEncode -C0 -pe 's/&#(\d+);/chr($1)/ge;
                                    $_ = decode_utf8($_, 1); 
                                    s/(\P{ASCII})/"&#".ord($1).";"/ge'
na&#239;vet&#233;
[download]

That means that if you were courageous enough, you could just do this:

$ perl -i.unmangled.by.$$ -MEncode -C0 -pe 's/&#(\d+);/chr($1)/ge; $_ 
+= decode_utf8($_, 1); s/(\P{ASCII})/"&#".ord($1).";"/ge' all*your*bro
+ken*files.xhtml
[download]

Here’s a version that runs from as a script instead of from the command line:

#!/usr/bin/env perl
use strict;
use warnings;
use Encode;
die "gimme args" unless @ARGV;
$^I = ".unmangled.by.$$";
while (<>) {
    s/&#(\d+);/chr($1)/ge;
    $_ = decode_utf8($_, 1);
    s/(\P{ASCII})/ "&#" . ord($1) . ";" /ge;
    print;
}
[download]

No warantees, though. Make sure you thoroughly understand all this before you further mangle your files.

In reply to Re^9: Encoding/decoding question by tchrist
in thread Encoding/decoding question by slugger415

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.