in reply to Re: Re: XML Simple Charset Q?
in thread XML Simple Charset Q?

Since the codes for Latin-1 are the same as Unicode for the first 256 values, that should work (you need to re-encode the values but don't need to translate them though a table). That is, if "use utf8" is not in scope when the regex is compiled. I don't know about Perl 5.8, which reportedly doesn't need the utf8 pragma—you might need some other way to refer to those character on the input.

Anyway, you can use the same light-weight trick to convert back. s/([\x{80}-\x{ff}])/pack('C',$1)/eg Compiled with utf8 in effect (note the curlies on the \x codes. This indicates UTF-8 encoded characters). Then use pack instead of chr so you can specify bytes (chr does too much DWIMary and the persuasion thing is not as transparant as one would hope when dealing with I/O, though I think it's behavior in 5.6 would work in this case).

—John

Replies are listed 'Best First'.
Re: Re: Re: Re: XML Simple Charset Q?
by jkahn (Friar) on Nov 25, 2002 at 20:03 UTC
    er, it's important to note that though the ISO-8859-1 codepoints are the same as Unicode (below 256), the encodings are not the same (values above 127 are encoded multi-byte in utf-8, but values in ISO-8859-1 are always encoded single-byte).

    I bet most people in this conversation know this, but it's a bit important to clarify. For me, it wasn't so long ago that I didn't know the difference between codepoint and encoding.

    Links to that subject:

      Right (I updated my post to clarify). His regex takes single-byte characters in the range 80-ff nd recodes them as HTML escape codes. Same number, just a different way of persisting it to the output stream.

      Inspired by that, I showed that the same idea can convert from UTF8 by using the utf8 pragma and the extended \x escape codes in the regex, and meanwhile encode to Latin-1 by using pack.