comment on

Latin-1 (iso-8859-1) is a subset of Unicode. UTF-8 is an algorithic transform of Unicode, which maps characters > 127 to multiple bytes. See rfc2279 for details or the Unicode site.

If you know that your characters are all from the Latin-1 character set (but in the UTF-8 encoding), you can just do this:

pack "C*", unpack "U*", $_
[download]

This maps directly to Latin-1. But for other character sets, you'll need table-driven mappings. There are modules that do this. See the Unicode::Map and similar modules on CPAN.

Here is a quickie which just handles Windows-1252:

#!perl -w
use strict;
my %unicode2win1252 = (
    0x0152 => 0x8C, 0x0153 => 0x9C, 0x0160 => 0x8A, 0x0161 => 0x9A,
    0x0178 => 0x9F, 0x017D => 0x8E, 0x017E => 0x9E, 0x0192 => 0x83,
    0x02C6 => 0x88, 0x02DC => 0x98, 0x2013 => 0x96, 0x2014 => 0x97,
    0x2018 => 0x91, 0x2019 => 0x92, 0x201A => 0x82, 0x201C => 0x93,
    0x201D => 0x94, 0x201E => 0x84, 0x2020 => 0x86, 0x2021 => 0x87,
    0x2022 => 0x95, 0x2026 => 0x85, 0x2030 => 0x89, 0x2039 => 0x8B,
    0x203A => 0x9B, 0x20AC => 0x80, 0x2122 => 0x99,
);
sub simplemap {
  my ($map, $str) = @_;
  pack "C*", map { $$map{$_}||$_ } unpack "U*", $str
}

my $a = "This is a " . pack("U*", 0x201c) . "test" . pack("U*", 0x201d
+)
       . " Okay, Jos" . pack("U*", 0xe9) . "?" . pack("U*", 0xfeff);

# The last character U+FEFF is not in Windows-1252 and is thrown in
# as an example of what happens to other characters.

my $b = simplemap(\%unicode2win1252, $a);
my $c = unpack("H*", $b); 
print "a = $a\nb = $b\nc = $c\n";
[download]

There are C and Java conversion routines at the ICU project. I derived the hash %unicode2win1252 from the data file data/ibm-5348.ucm. See data/convrtrs.txt for the names of the character sets.

In reply to Re: regex for utf-8 by Thelonius
in thread regex for utf-8 by jjohhn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.