in reply to Re: XML Simple Charset Q?
in thread XML Simple Charset Q?

The problem here is I'm trying to process many such snippets of XML for output as HTML. I suspect its easier in this case to do a substitution regex with the /e parameter instead of going through all the mungeing back from UTF-8.

s/([\x80-\xff])/'&#'.ord($1).';'/eg
appears to work for all the characters I care about.

Update XML::Parser still insists on converting &#NNN; to UTF-8! I didn't notice as mozilla cunningly noted the changed page encoding and displayed automagically as UTF-8. Mutter Mutter Curse Curse - this is a major pain as I'd like the page to remain Latin-1.

Dingus


Enter any 47-digit prime number to continue.

Replies are listed 'Best First'.
Re: Re: Re: XML Simple Charset Q?
by John M. Dlugosz (Monsignor) on Nov 25, 2002 at 19:52 UTC
    Since the codes for Latin-1 are the same as Unicode for the first 256 values, that should work (you need to re-encode the values but don't need to translate them though a table). That is, if "use utf8" is not in scope when the regex is compiled. I don't know about Perl 5.8, which reportedly doesn't need the utf8 pragma—you might need some other way to refer to those character on the input.

    Anyway, you can use the same light-weight trick to convert back. s/([\x{80}-\x{ff}])/pack('C',$1)/eg Compiled with utf8 in effect (note the curlies on the \x codes. This indicates UTF-8 encoded characters). Then use pack instead of chr so you can specify bytes (chr does too much DWIMary and the persuasion thing is not as transparant as one would hope when dealing with I/O, though I think it's behavior in 5.6 would work in this case).

    —John

      er, it's important to note that though the ISO-8859-1 codepoints are the same as Unicode (below 256), the encodings are not the same (values above 127 are encoded multi-byte in utf-8, but values in ISO-8859-1 are always encoded single-byte).

      I bet most people in this conversation know this, but it's a bit important to clarify. For me, it wasn't so long ago that I didn't know the difference between codepoint and encoding.

      Links to that subject:

        Right (I updated my post to clarify). His regex takes single-byte characters in the range 80-ff nd recodes them as HTML escape codes. Same number, just a different way of persisting it to the output stream.

        Inspired by that, I showed that the same idea can convert from UTF8 by using the utf8 pragma and the extended \x escape codes in the regex, and meanwhile encode to Latin-1 by using pack.