Transliterate cp1252 0x80-0x9f to utf8 equivalents

Not particularly clever but may help anyone who has had as much trouble with cp1252 as I've had. And it may save some typing.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::Entities;

# random selection of cp1252 goodies
my $str = join('',
  chr(0x80), chr(0x81), chr(0x91), chr(0x92),
  chr(0x93), chr(0x94), chr(0x95), chr(0x96),
);
my $original = $str;

# delete any chars not assigned
$str =~ tr/\x81\x8D\x8F\x90\x9D//d;

# replace the rest
$str =~ tr{\x80\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8E\x91\x9
+2\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9E\x9F}
          {\x{20AC}\x{201A}\x{0192}\x{201E}\x{2026}\x{2020}\x{2021}\x{
+02C6}\x{2030}\x{0160}\x{2039}\x{0152}\x{017D}\x{2018}\x{2019}\x{201C}
+\x{201D}\x{2022}\x{2013}\x{2014}\x{02DC}\x{2122}\x{0161}\x{203A}\x{01
+53}\x{017E}\x{0178}/};

# check what happened without trying to print wide chars
my $encoded = encode_entities($str);
$str =~ s/(.)/sprintf( "\\x{%x}", ord($1))/eg;

print qq{original: $original\n};
print qq{hex: $str\n};
print qq{encoded: $encoded\n};

print qq{done\n};


__DATA__
80    0x20AC
81
82    0x201A
83    0x0192
84    0x201E
85    0x2026
86    0x2020
87    0x2021
88    0x02C6
89    0x2030
8A    0x0160
8B    0x2039
8C    0x0152
8D
8E    0x017D
8F
90
91    0x2018
92    0x2019
93    0x201C
94    0x201D
95    0x2022
96    0x2013
97    0x2014
98    0x02DC
99    0x2122
9A    0x0161
9B    0x203A
9C    0x0153
9D
9E    0x017E
9F    0x0178
[download]

Comment on Transliterate cp1252 0x80-0x9f to utf8 equivalents Download Code

Replies are listed 'Best First'.
Re: Transliterate cp1252 0x80-0x9f to utf8 equivalents by graff (Chancellor) on Sep 22, 2007 at 15:43 UTC
Um, why wouldn't you just do this: `use Encode; my $cp1252_str = join('', chr(0x80), chr(0x81), chr(0x91), chr(0x92), chr(0x93), chr(0x94), chr(0x95), chr(0x96), ); my $utf8_str = decode( 'cp1252', $cp1252_str ); # update: you can now remove "unmapped" byte values this way: $utf8_str =~ tr/\x{fffd}//d;` [download] And if the cp1252 data is coming in from a file (or a file handle), you could just do this: `open( INPUT, "<:encoding(cp1252)", $filename ) or die $!; # or, if the file handle is already open (e.g. STDIN): # binmode FILEHANDLE, ":encoding(cp1252)"; while (<INPUT>) { # $_ contains utf8 data... }` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re: Transliterate cp1252 0x80-0x9f to utf8 equivalents
by graff (Chancellor) on Sep 22, 2007 at 15:43 UTC

use Encode;

my $cp1252_str = join('',
  chr(0x80), chr(0x81), chr(0x91), chr(0x92),
  chr(0x93), chr(0x94), chr(0x95), chr(0x96),
);

my $utf8_str = decode( 'cp1252', $cp1252_str );

# update: you can now remove "unmapped" byte values this way:

$utf8_str =~ tr/\x{fffd}//d;
[download]

open( INPUT, "<:encoding(cp1252)", $filename ) or die $!;

# or, if the file handle is already open (e.g. STDIN):
#  binmode FILEHANDLE, ":encoding(cp1252)";

while (<INPUT>) {
    # $_ contains utf8 data...
}
[download]

[reply]
[d/l]
[select]