emav has asked for the wisdom of the Perl Monks concerning the following question:

I've been looking for a script to do just that for quite some time now but, as I couldn't find anything that satisfied my needs, I thought I should port the java function that is available here. I would be happy to read any comments or suggestions for improvement.

#!/usr/bin/perl -- print "Content-Type: text/html\n\n"; print "<html><head>"; print "<META HTTP-EQUIV='CONTENT-TYPE'"; print " CONTENT='text/html; charset=utf-8'>"; print "</head><body>"; &pair_split; # Results can be manipulated here print "</body></html>"; exit; # Parse input -- Use POST method sub pair_split{ read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs = split(/&/, $buffer); foreach $pair (@pairs) { local($name, $value) = split(/=/, $pair); if ($name =~ /\%/) {$strng = $name; &xxutf; $name = $sbuf;} if ($value =~ /\%/) {$strng = $value; &xxutf; $value = $sbuf;} $name =~ tr/+/ /; $name =~ tr/\0//d; $value =~ tr/+/ /; $value =~ tr/\0//d; $$name = $value; # Assign values to $names } } # Decode XX-encoded letters to UTF-8 sub xxutf { $l = length($strng); $ch = -1; $b = 0; $sumb = 0; for ($i = 0, $more = -1; $i < $l; $i++) { # Get next byte b from URL segment strng $ch = substr $strng, $i, 1; if ($ch eq '%') { $i++; $ch = substr $strng, $i, 1; $hb = ($ch =~ /[0-9]/) ? $ch - '0' : 10+(ord(lc$ch) - ord('a')) & 15; $i++; $ch = substr $strng, $i, 1; $lb = ($ch =~ /[0-9]/) ? $ch - '0' : 10+(ord(lc$ch) - ord('a')) & 15; $b = ($hb << 4) | $lb; } elsif ($ch eq '+') { $b = ' '; } else {$b = $ch}; # Decode byte b as UTF-8, sumb collects incomplete chars if (($b & 192) == 128) { # 10xxxxxx (continuation byte) $sumb = ($sumb << 6) | ($b & 63); # Add 6 bits to sumb if (--$more == 0) { $sumb = "&#" . $sumb; # Create UTF-8 encoding $sumb = $sumb . ";"; $sbuf = $sbuf . $sumb; # Add char to sbuf } } elsif (($b & 128) == 0) { # 0xxxxxxx (yields 7 bits) $sbuf = $sbuf . $b; # Store in sbuf } elsif (($b & 224) == 192) { # 110xxxxx (yields 5 bits) $sumb = $b & 31; $more = 1; # Expect 1 more byte } elsif (($b & 240) == 224) { # 1110xxxx (yields 4 bits) $sumb = $b & 15; $more = 2; # Expect 2 more bytes } elsif (($b & 248) == 240) { # 11110xxx (yields 3 bits) $sumb = $b & 7; $more = 3; # Expect 3 more bytes } elsif (($b & 252) == 248) { # 111110xx (yields 2 bits) $sumb = $b & 3; $more = 4; # Expect 4 more bytes } else { # if ((b & 0xfe) == 0xfc) $sumb = $b & 1; # 1111110x (yields 1 bit) $more = 5; # Expect 5 more bytes } # We don't test if the UTF-8 encoding is well-formed } }

Replies are listed 'Best First'.
Re: Perl script to transform XX-encoding to UTF-8
by kyoshu (Curate) on Jun 23, 2005 at 18:47 UTC
    there are alot perl modules that can do it like Text::Iconv or even open can do it if you need to convert some file.
Re: Perl script to transform XX-encoding to UTF-8
by Adrade (Pilgrim) on Jun 23, 2005 at 23:27 UTC
    I'm perhaps not totally sure what you're going for, but I believe that what you're doing can also be accomplished with the following. However, well done on the port!
    $string =~ s/%(..)/pack("c",hex($1))/ge;
      -Adam

    Update: The CGI module seems to do this transformation correctly. Since it appears you're writing a CGI script, perhaps that module would be useful.

    --
    Impossible! The Remonster can only be killed by stabbing him in the heart with the ancient bone saber of Zumakalis!

      I'm aware of this regex but it doesn't work with extended unicode character sets. At least, it didn't work with the accented ancient Greek characters (1F00-1FFF) that I am interested in.