I've been looking for a script to do just that for quite some time now but, as I couldn't find anything that satisfied my needs, I thought I should port the java function that is available here. I would be happy to read any comments or suggestions for improvement.

#!/usr/bin/perl -- print "Content-Type: text/html\n\n"; print "<html><head>"; print "<META HTTP-EQUIV='CONTENT-TYPE'"; print " CONTENT='text/html; charset=utf-8'>"; print "</head><body>"; &pair_split; # Results can be manipulated here print "</body></html>"; exit; # Parse input -- Use POST method sub pair_split{ read(STDIN, $buffer, $ENV{'CONTENT_LENGTH'}); @pairs = split(/&/, $buffer); foreach $pair (@pairs) { local($name, $value) = split(/=/, $pair); if ($name =~ /\%/) {$strng = $name; &xxutf; $name = $sbuf;} if ($value =~ /\%/) {$strng = $value; &xxutf; $value = $sbuf;} $name =~ tr/+/ /; $name =~ tr/\0//d; $value =~ tr/+/ /; $value =~ tr/\0//d; $$name = $value; # Assign values to $names } } # Decode XX-encoded letters to UTF-8 sub xxutf { $l = length($strng); $ch = -1; $b = 0; $sumb = 0; for ($i = 0, $more = -1; $i < $l; $i++) { # Get next byte b from URL segment strng $ch = substr $strng, $i, 1; if ($ch eq '%') { $i++; $ch = substr $strng, $i, 1; $hb = ($ch =~ /[0-9]/) ? $ch - '0' : 10+(ord(lc$ch) - ord('a')) & 15; $i++; $ch = substr $strng, $i, 1; $lb = ($ch =~ /[0-9]/) ? $ch - '0' : 10+(ord(lc$ch) - ord('a')) & 15; $b = ($hb << 4) | $lb; } elsif ($ch eq '+') { $b = ' '; } else {$b = $ch}; # Decode byte b as UTF-8, sumb collects incomplete chars if (($b & 192) == 128) { # 10xxxxxx (continuation byte) $sumb = ($sumb << 6) | ($b & 63); # Add 6 bits to sumb if (--$more == 0) { $sumb = "&#" . $sumb; # Create UTF-8 encoding $sumb = $sumb . ";"; $sbuf = $sbuf . $sumb; # Add char to sbuf } } elsif (($b & 128) == 0) { # 0xxxxxxx (yields 7 bits) $sbuf = $sbuf . $b; # Store in sbuf } elsif (($b & 224) == 192) { # 110xxxxx (yields 5 bits) $sumb = $b & 31; $more = 1; # Expect 1 more byte } elsif (($b & 240) == 224) { # 1110xxxx (yields 4 bits) $sumb = $b & 15; $more = 2; # Expect 2 more bytes } elsif (($b & 248) == 240) { # 11110xxx (yields 3 bits) $sumb = $b & 7; $more = 3; # Expect 3 more bytes } elsif (($b & 252) == 248) { # 111110xx (yields 2 bits) $sumb = $b & 3; $more = 4; # Expect 4 more bytes } else { # if ((b & 0xfe) == 0xfc) $sumb = $b & 1; # 1111110x (yields 1 bit) $more = 5; # Expect 5 more bytes } # We don't test if the UTF-8 encoding is well-formed } }

In reply to Perl script to transform XX-encoding to UTF-8 by emav

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.