I have a feeling that you are dealing with UTF8 encoded data, but you also could be dealing with code-points, and we need to figure out which is the case.
When you refer to ®, can you tell if it is stored as two perl characters or just one? That is, if you were to isolate ® into a string variable (say $x), what would print length($x) display?
Here's an example that illustrates the difference:
my $x = chr(174);
binmode STDOUT, ':utf8';
print "x has length ", length($x), " >>$x<<\n";
and this emits:
x has length 1 >>®<<
So, even though $x has length 1, it looks like it has length 2 when printed out. On the other hand, it also could have length 2:
my $x = chr(194).chr(174);
binmode STDOUT, ':bytes';
print "x has length ", length($x), " >>$x<<\n";
and this emits:
x has length 2 >>®<<
The upshot is that if $x has length 1, your string probably contains Unicode code-points, and you'll likely want to look into using the encode_entities function from the module HTML::Entities. This is a general way to convert code-points to HTML entity references.
On the other hand, if $x has length 2, then your string probably contains UTF8 encoded characters. You would then likely find it advantageous to convert that UTF8 stream into Unicode code-points using the encode function from the Encode module like this:
use Encode;
my $code_points = encode('utf8', $x);
The reason you would like to use code-points in your program rather than UTF8 bytes is that perl is much more adept at handling strings when they are stored as code-points.
|