Latin-1 (iso-8859-1) is a subset of Unicode. UTF-8 is an algorithic transform of Unicode, which maps characters > 127 to multiple bytes. See rfc2279 for details or the Unicode site.

If you know that your characters are all from the Latin-1 character set (but in the UTF-8 encoding), you can just do this:

pack "C*", unpack "U*", $_
This maps directly to Latin-1. But for other character sets, you'll need table-driven mappings. There are modules that do this. See the Unicode::Map and similar modules on CPAN.

Here is a quickie which just handles Windows-1252:

#!perl -w use strict; my %unicode2win1252 = ( 0x0152 => 0x8C, 0x0153 => 0x9C, 0x0160 => 0x8A, 0x0161 => 0x9A, 0x0178 => 0x9F, 0x017D => 0x8E, 0x017E => 0x9E, 0x0192 => 0x83, 0x02C6 => 0x88, 0x02DC => 0x98, 0x2013 => 0x96, 0x2014 => 0x97, 0x2018 => 0x91, 0x2019 => 0x92, 0x201A => 0x82, 0x201C => 0x93, 0x201D => 0x94, 0x201E => 0x84, 0x2020 => 0x86, 0x2021 => 0x87, 0x2022 => 0x95, 0x2026 => 0x85, 0x2030 => 0x89, 0x2039 => 0x8B, 0x203A => 0x9B, 0x20AC => 0x80, 0x2122 => 0x99, ); sub simplemap { my ($map, $str) = @_; pack "C*", map { $$map{$_}||$_ } unpack "U*", $str } my $a = "This is a " . pack("U*", 0x201c) . "test" . pack("U*", 0x201d +) . " Okay, Jos" . pack("U*", 0xe9) . "?" . pack("U*", 0xfeff); # The last character U+FEFF is not in Windows-1252 and is thrown in # as an example of what happens to other characters. my $b = simplemap(\%unicode2win1252, $a); my $c = unpack("H*", $b); print "a = $a\nb = $b\nc = $c\n";
There are C and Java conversion routines at the ICU project. I derived the hash %unicode2win1252 from the data file data/ibm-5348.ucm. See data/convrtrs.txt for the names of the character sets.

In reply to Re: regex for utf-8 by Thelonius
in thread regex for utf-8 by jjohhn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.