Intra-Unicode Conversions

kettle has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Intra-Unicode Conversions by jonadab (Parson) on Nov 15, 2006 at 14:11 UTC
The \| binds more tightly than the comma, so the sprintf can be written, for clarity, like this: `sprintf("%c%c", (0xc0 \| ($o >> 6)), (0x80 \| ($o & 0x3f)) )` [download] Wow, talk about bit-fiddling. For the meanings of the specific operators, see perldoc perlop, but in short we're making the first byte by shifting over the bits in the original character and then flipping on certain bits, resulting in the high bits of the original character being the low bits of our first byte. Then we're making the second byte by turning off certain bits in the original value (chiefly, the high bits, which are already represented in the first byte) and turning others on. I think. I don't know enough about unicode to explain the reasons behind the particulars, in terms of which bits end up where. If I were trying to figure it out, I'd draw myself a little diagram... Read more... (2 kB) Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. You can just call me "Mister Sanity". Why, I've got so much sanity it's driving me crazy.	[reply] [d/l]
Re: Intra-Unicode Conversions by robartes (Priest) on Nov 15, 2006 at 14:51 UTC
What the rather convoluted regex in your post does is take a character (in the narrow definition thereof: unsigned char in C speak) in the range 0x80 - 0xFF (your basic 'code page playground' of yore) and convert that to its valid UTF-8 representation. It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them. UTF-8 says that any Unicode codepoint in the range U+0080 to U+07FF is encoded in two bytes, with the first three bits (highest order bits) of the first (highest order) byte being 110 and the first two bits of the second byte being 10. The remaining 11 bits are used to store the actual codepoint value. E.g., the character U+00A4 (the currency symbol ¤) is stored as follows: `Codepoint U+00A4 --> hex 0xA4 --> binary 10100100 We need to store 10100100 in the UTF-8 bytes: 110..... 10..... We distribute 10100100 over the 'points' in the two bytes: 110 00010 10 100100 So U+00A4 in UTF-8 becomes 1100010 10100100 or 0xc2 0xa4.` [download] Note that if the original text was in ISO 8859-15, 0xA4 is the euro symbol € which would be translated to ¤ by the regex. Anyway, the bit twiddling in the sprintf does the UTF8 conversion (I'm using jonadab's representation here): `sprintf("%c%c", # Build first byte by OR'ing 0xc0 (binary 11000000) with # the two highest order bits of the character (0xc0 \| ($o >> 6)), # Build the second byte by OR'ing 0x80 (binary 10000000) # with the lower 6 bits of the character (obtained by # AND'ing with 0x3f, 00011111) (0x80 \| ($o & 0x3f))` [download] Please excuse my gratuitous invention of new English verbs. CU Robartes-	[reply] [d/l] [select]
Re^2: Intra-Unicode Conversions (naive) by tye (Sage) on Nov 15, 2006 at 16:37 UTC
It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them. And Perl is exactly this naive as well. You can get this exact same result without writing any bit-twiddling Perl code by instead convincing Perl to promote the string to UTF-8, and then storing the resulting bytes into a Perl byte string (or by just turning off the "is UTF-8" bit on that Perl scalar). For example: #!/usr/bin/perl -w use strict; require utf8; my $s= pack "C", 1..255; # Byte string to convert my $u= pack "U", 1..255; # UTF-8 string my $e= substr($u,0,0); # Empty UTF-8 string my $r= $s; # Convert using regex $r =~ s{ ([^\0-\x7F]) }{ my $o= ord($1); sprintf "%c%c", 0xc0 \| ( $o >> 6 ), 0x80 \| ( $o & 0x3f ); }gex; my $i= $s.$e; # Convert by implicit upgrade to UTF-8 my $f= $s; # Upgrade via utf8.pm function utf8::upgrade( $f ); my $b= $s; # Upgrade then mark as bytes utf8::encode( $b ); if( $r eq $b ) { print "The regex and utf8::encode() match.\n"; } if( $u eq $i && $i eq $f ) { print "The 3 Unicode strings match.\n"; } if( join(" ",unpack"C",$r) eq join(" ",unpack"C",$i) ) { print "The byte- and unicode-strings have the same bytes.\n"; } if( $r ne $i ) { print "The byte- and unicode-strings are not equal.\n"; } print '$s contains ', length($s), " bytes.\n"; print '$i contains ', length($i), " characters.\n"; print '$r contains ', length($r), " bytes.\n"; [download] Which produces: `The regex and utf8::encode() match. The 3 Unicode strings match. The byte- and unicode-strings have the same bytes. The byte- and unicode-strings are not equal. $s contains 255 bytes. $i contains 255 characters. $r contains 383 bytes.` [download] The regex is different in that it doesn't mollest null bytes. If you change "1..255" to "0..255" in the above code, you'll see that when Perl (v5.8.7 on Win32, anyway) converts a byte string to Unicode, it just unceremoniously stops at any bytes of value 0. - tye	[reply] [d/l] [select]
Re: Intra-Unicode Conversions by kettle (Beadle) on Nov 15, 2006 at 17:46 UTC
Thanks so much for all the great explanations! That really cleared things up. I also wrote a hack to solve the problem I alluded to at the end of my first post. It is extremely inefficient, but seems to properly carry out all of the conversions I'm interested in. Please rip it to shreds for me: #!/usr/bin/perl -w use strict; use warnings; #Convert annoying, worthless 'fullwidth' Latin-1 characters #to their semi-sane normal ASCII counterparts my(%codes,$wide,$ascii,$x); #this is the land where the fullwidth latin-1 characters reside for($x=65281;$x<65374;$x++){ ($wide,$ascii) = make_codes($x); $codes{$wide} = $ascii; } while(<>){ chomp; foreach my $utf(keys %codes){ s/$utf/$codes{$utf}/g; } print $_."\n"; } sub make_codes{ my $ud = $_[0]; my $from = ud_to_utf8hex($ud); #subtract 65248 to get the ASCII value my $to = ud_to_utf8hex($ud-65248); return($from, $to); } sub ud_to_utf8hex{ my $ud = $_[0]; my ($b1,$b2,$b3); if($ud >= 0 && $ud <= 127){ #basic ASCII values don't need to be a +ltered return(sprintf("%c",$ud)); }elsif($ud >= 2048 && $ud <= 65535){ #valid for 2048 <= $ud <= 655 +35 $b1 = 224 + sprintf("%d", ($ud/4096)); $b2 = 128 + (($ud/64) % 64); $b3 = 128 + ($ud % 64); } return(sprintf("\\x\{%X\}\\x\{%X\}\\x\{%X\}",$b1,$b2,$b3)); } [download] I chose to use the decimal numbers for some practice with the unicode standard, and because they are easier on my brain than the hex codes.	[reply] [d/l]
Re: Intra-Unicode Conversions by caelifer (Scribe) on Nov 15, 2006 at 14:29 UTC
In order to understand UTF-8 encoding rules I suggest to look at the UTF-8 article on Wikipedia. BR	[reply]