in reply to regex for utf-8
If you know that your characters are all from the Latin-1 character set (but in the UTF-8 encoding), you can just do this:
This maps directly to Latin-1. But for other character sets, you'll need table-driven mappings. There are modules that do this. See the Unicode::Map and similar modules on CPAN.pack "C*", unpack "U*", $_
Here is a quickie which just handles Windows-1252:
There are C and Java conversion routines at the ICU project. I derived the hash %unicode2win1252 from the data file data/ibm-5348.ucm. See data/convrtrs.txt for the names of the character sets.#!perl -w use strict; my %unicode2win1252 = ( 0x0152 => 0x8C, 0x0153 => 0x9C, 0x0160 => 0x8A, 0x0161 => 0x9A, 0x0178 => 0x9F, 0x017D => 0x8E, 0x017E => 0x9E, 0x0192 => 0x83, 0x02C6 => 0x88, 0x02DC => 0x98, 0x2013 => 0x96, 0x2014 => 0x97, 0x2018 => 0x91, 0x2019 => 0x92, 0x201A => 0x82, 0x201C => 0x93, 0x201D => 0x94, 0x201E => 0x84, 0x2020 => 0x86, 0x2021 => 0x87, 0x2022 => 0x95, 0x2026 => 0x85, 0x2030 => 0x89, 0x2039 => 0x8B, 0x203A => 0x9B, 0x20AC => 0x80, 0x2122 => 0x99, ); sub simplemap { my ($map, $str) = @_; pack "C*", map { $$map{$_}||$_ } unpack "U*", $str } my $a = "This is a " . pack("U*", 0x201c) . "test" . pack("U*", 0x201d +) . " Okay, Jos" . pack("U*", 0xe9) . "?" . pack("U*", 0xfeff); # The last character U+FEFF is not in Windows-1252 and is thrown in # as an example of what happens to other characters. my $b = simplemap(\%unicode2win1252, $a); my $c = unpack("H*", $b); print "a = $a\nb = $b\nc = $c\n";
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: regex for utf-8
by jjohhn (Scribe) on Feb 28, 2003 at 02:14 UTC | |
by Thelonius (Priest) on Feb 28, 2003 at 03:34 UTC | |
by Anonymous Monk on Feb 28, 2003 at 05:13 UTC | |
by Thelonius (Priest) on Feb 28, 2003 at 16:12 UTC | |
by Anonymous Monk on Feb 28, 2003 at 22:58 UTC | |
|