kettle has asked for the wisdom of the Perl Monks concerning the following question:

I would be very greatful if someone could give me a detailed description of the following regular expression:

s/([^\0-\x7F])/do {my $o = ord($1); sprintf("%c%c", 0xc0 | ($o >> 6), 0x80 | ($o & 0x3f)) }/ge

The above is the meat from a uri_escape_utf8 subroutine. I get the gist of what is going on; we're converting a utf-encoded string to its corresponding ASCII only uri representation. Furthermore,

First we capture any character whose hex representation does not fall within the range specified by the negative character class [^\0-\x7F] (Which comprises the basic ASCII character set, I think).

Next every time we grab a legal character, we store the numeric value of our character in $o, then we sprintf the conversion... But I don't know exactly what is going on in the sprintf statement...

I want to do something similar, convert all full and half width latin characters (0x0FF00-0x0FF5E) to their more common ASCII representations (0x00020-0x0007F), so any lucid explanation of the above code, or alternatively, how to directly solve my problem, will be greatly appreciated!!

P.S. I know I can do this with a hash of corresponding values, I'm hoping there is something shorter and more interesting.

Replies are listed 'Best First'.
Re: Intra-Unicode Conversions
by jonadab (Parson) on Nov 15, 2006 at 14:11 UTC

    The | binds more tightly than the comma, so the sprintf can be written, for clarity, like this:

    sprintf("%c%c", (0xc0 | ($o >> 6)), (0x80 | ($o & 0x3f)) )

    Wow, talk about bit-fiddling. For the meanings of the specific operators, see perldoc perlop, but in short we're making the first byte by shifting over the bits in the original character and then flipping on certain bits, resulting in the high bits of the original character being the low bits of our first byte. Then we're making the second byte by turning off certain bits in the original value (chiefly, the high bits, which are already represented in the first byte) and turning others on. I think.

    I don't know enough about unicode to explain the reasons behind the particulars, in terms of which bits end up where. If I were trying to figure it out, I'd draw myself a little diagram...


    Sanity? Oh, yeah, I've got all kinds of sanity. In fact, I've developed whole new kinds of sanity. You can just call me "Mister Sanity". Why, I've got so much sanity it's driving me crazy.
Re: Intra-Unicode Conversions
by robartes (Priest) on Nov 15, 2006 at 14:51 UTC

    What the rather convoluted regex in your post does is take a character (in the narrow definition thereof: unsigned char in C speak) in the range 0x80 - 0xFF (your basic 'code page playground' of yore) and convert that to its valid UTF-8 representation. It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them.

    UTF-8 says that any Unicode codepoint in the range U+0080 to U+07FF is encoded in two bytes, with the first three bits (highest order bits) of the first (highest order) byte being 110 and the first two bits of the second byte being 10. The remaining 11 bits are used to store the actual codepoint value. E.g., the character U+00A4 (the currency symbol ¤) is stored as follows:

    Codepoint U+00A4 --> hex 0xA4 --> binary 10100100 We need to store 10100100 in the UTF-8 bytes: 110..... 10..... We distribute 10100100 over the 'points' in the two bytes: 110 00010 10 100100 So U+00A4 in UTF-8 becomes 1100010 10100100 or 0xc2 0xa4.

    Note that if the original text was in ISO 8859-15, 0xA4 is the euro symbol € which would be translated to ¤ by the regex.

    Anyway, the bit twiddling in the sprintf does the UTF8 conversion (I'm using jonadab's representation here):

    sprintf("%c%c", # Build first byte by OR'ing 0xc0 (binary 11000000) with # the two highest order bits of the character (0xc0 | ($o >> 6)), # Build the second byte by OR'ing 0x80 (binary 10000000) # with the lower 6 bits of the character (obtained by # AND'ing with 0x3f, 00011111) (0x80 | ($o & 0x3f))

    Please excuse my gratuitous invention of new English verbs.

    CU
    Robartes-

      It does this in a very naieve way, BTW, that is only valid if the input text is in the code page whose characters 0x80 - 0xFF correspond to Unicode code points U+0080 to U+00FF, which is to say not many of them.

      And Perl is exactly this naive as well. You can get this exact same result without writing any bit-twiddling Perl code by instead convincing Perl to promote the string to UTF-8, and then storing the resulting bytes into a Perl byte string (or by just turning off the "is UTF-8" bit on that Perl scalar). For example:

      #!/usr/bin/perl -w use strict; require utf8; my $s= pack "C*", 1..255; # Byte string to convert my $u= pack "U*", 1..255; # UTF-8 string my $e= substr($u,0,0); # Empty UTF-8 string my $r= $s; # Convert using regex $r =~ s{ ([^\0-\x7F]) }{ my $o= ord($1); sprintf "%c%c", 0xc0 | ( $o >> 6 ), 0x80 | ( $o & 0x3f ); }gex; my $i= $s.$e; # Convert by implicit upgrade to UTF-8 my $f= $s; # Upgrade via utf8.pm function utf8::upgrade( $f ); my $b= $s; # Upgrade then mark as bytes utf8::encode( $b ); if( $r eq $b ) { print "The regex and utf8::encode() match.\n"; } if( $u eq $i && $i eq $f ) { print "The 3 Unicode strings match.\n"; } if( join(" ",unpack"C*",$r) eq join(" ",unpack"C*",$i) ) { print "The byte- and unicode-strings have the same bytes.\n"; } if( $r ne $i ) { print "The byte- and unicode-strings are not equal.\n"; } print '$s contains ', length($s), " bytes.\n"; print '$i contains ', length($i), " characters.\n"; print '$r contains ', length($r), " bytes.\n";

      Which produces:

      The regex and utf8::encode() match. The 3 Unicode strings match. The byte- and unicode-strings have the same bytes. The byte- and unicode-strings are not equal. $s contains 255 bytes. $i contains 255 characters. $r contains 383 bytes.

      The regex is different in that it doesn't mollest null bytes. If you change "1..255" to "0..255" in the above code, you'll see that when Perl (v5.8.7 on Win32, anyway) converts a byte string to Unicode, it just unceremoniously stops at any bytes of value 0.

      - tye        

Re: Intra-Unicode Conversions
by kettle (Beadle) on Nov 15, 2006 at 17:46 UTC
    Thanks so much for all the great explanations! That really cleared things up. I also wrote a hack to solve the problem I alluded to at the end of my first post. It is extremely inefficient, but seems to properly carry out all of the conversions I'm interested in. Please rip it to shreds for me:
    #!/usr/bin/perl -w use strict; use warnings; #Convert annoying, worthless 'fullwidth' Latin-1 characters #to their semi-sane normal ASCII counterparts my(%codes,$wide,$ascii,$x); #this is the land where the fullwidth latin-1 characters reside for($x=65281;$x<65374;$x++){ ($wide,$ascii) = make_codes($x); $codes{$wide} = $ascii; } while(<>){ chomp; foreach my $utf(keys %codes){ s/$utf/$codes{$utf}/g; } print $_."\n"; } sub make_codes{ my $ud = $_[0]; my $from = ud_to_utf8hex($ud); #subtract 65248 to get the ASCII value my $to = ud_to_utf8hex($ud-65248); return($from, $to); } sub ud_to_utf8hex{ my $ud = $_[0]; my ($b1,$b2,$b3); if($ud >= 0 && $ud <= 127){ #basic ASCII values don't need to be a +ltered return(sprintf("%c",$ud)); }elsif($ud >= 2048 && $ud <= 65535){ #valid for 2048 <= $ud <= 655 +35 $b1 = 224 + sprintf("%d", ($ud/4096)); $b2 = 128 + (($ud/64) % 64); $b3 = 128 + ($ud % 64); } return(sprintf("\\x\{%X\}\\x\{%X\}\\x\{%X\}",$b1,$b2,$b3)); }
    I chose to use the decimal numbers for some practice with the unicode standard, and because they are easier on my brain than the hex codes.
Re: Intra-Unicode Conversions
by caelifer (Scribe) on Nov 15, 2006 at 14:29 UTC
    In order to understand UTF-8 encoding rules I suggest to look at the UTF-8 article on Wikipedia.

    BR