I had a problem with an application that produced a horrible mixed UTF-8 and ISO-8859 encoded XML output. I found this way to transform it to pure UTF-8 without double-encoding the UTF-8 sequences that were already there. I know this will not work in all cases, but it has been helpful. What do you think?
#!/usr/bin/perl use strict; # mixed string with ISO 8859-1 und UTF-8: my $test_string = "Das Å (auch \"bolle-Å\" genannt, was soviel bedeute +t wie \"Kringel-Å\") ist mit der ". force_utf8("dänischen Rechtschreibreform von 1948 eingeführt worde +n."); print "Source: $test_string\n"; print "UTF : ".force_utf8($test_string)."\n"; print "ISO : ".force_latin($test_string)."\n"; sub force_utf8 { my $string = shift; $string =~ s/([\xc0-\xdf][\x80-\xbf]{1}|[\xe0-\xef][\x80-\xbf]{2}| +[\xf0-\xf7][\x80-\xbf]{3}|[\x80-\xff])/&encode_char_utf8($1)/ge; return $string; } sub force_latin { my $string = shift; $string =~ s/([\xc0-\xdf][\x80-\xbf]{1}|[\xe0-\xef][\x80-\xbf]{2}| +[\xf0-\xf7][\x80-\xbf]{3}|[\x80-\xff])/&decode_char_utf8($1)/ge; return $string; } sub encode_char_utf8 { my $char = shift; if($char =~ /^([\xc0-\xdf][\x80-\xbf]{1}|[\xe0-\xef][\x80-\xbf]{2} +|[\xf0-\xf7][\x80-\xbf]{3})$/) { return $char; } my $value = ord($char); return chr(($value>>6) | 0xc0).chr(0x80 | ($value & 0x3f)); } sub decode_char_utf8 { my $char = shift; if($char =~ /^([\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3} +)$/) { return ''; } elsif($char =~ /^([\xc0-\xdf])([\x80-\xbf])$/) { my $value = ((ord($1) & 0x1f)<<6)+(ord($2) & 0x3f); if($value<256) { return chr($value); } else { return ''; } } else { return $char; } }

In reply to Mixed ISO-8859/UTF-8 conversion by olli

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.