If you know that your characters are all from the Latin-1 character set (but in the UTF-8 encoding), you can just do this:
This maps directly to Latin-1. But for other character sets, you'll need table-driven mappings. There are modules that do this. See the Unicode::Map and similar modules on CPAN.pack "C*", unpack "U*", $_
Here is a quickie which just handles Windows-1252:
There are C and Java conversion routines at the ICU project. I derived the hash %unicode2win1252 from the data file data/ibm-5348.ucm. See data/convrtrs.txt for the names of the character sets.#!perl -w use strict; my %unicode2win1252 = ( 0x0152 => 0x8C, 0x0153 => 0x9C, 0x0160 => 0x8A, 0x0161 => 0x9A, 0x0178 => 0x9F, 0x017D => 0x8E, 0x017E => 0x9E, 0x0192 => 0x83, 0x02C6 => 0x88, 0x02DC => 0x98, 0x2013 => 0x96, 0x2014 => 0x97, 0x2018 => 0x91, 0x2019 => 0x92, 0x201A => 0x82, 0x201C => 0x93, 0x201D => 0x94, 0x201E => 0x84, 0x2020 => 0x86, 0x2021 => 0x87, 0x2022 => 0x95, 0x2026 => 0x85, 0x2030 => 0x89, 0x2039 => 0x8B, 0x203A => 0x9B, 0x20AC => 0x80, 0x2122 => 0x99, ); sub simplemap { my ($map, $str) = @_; pack "C*", map { $$map{$_}||$_ } unpack "U*", $str } my $a = "This is a " . pack("U*", 0x201c) . "test" . pack("U*", 0x201d +) . " Okay, Jos" . pack("U*", 0xe9) . "?" . pack("U*", 0xfeff); # The last character U+FEFF is not in Windows-1252 and is thrown in # as an example of what happens to other characters. my $b = simplemap(\%unicode2win1252, $a); my $c = unpack("H*", $b); print "a = $a\nb = $b\nc = $c\n";
In reply to Re: regex for utf-8
by Thelonius
in thread regex for utf-8
by jjohhn
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |