in reply to matching special char

I have a feeling that you are dealing with UTF8 encoded data, but you also could be dealing with code-points, and we need to figure out which is the case.

When you refer to ®, can you tell if it is stored as two perl characters or just one? That is, if you were to isolate ® into a string variable (say $x), what would print length($x) display?

Here's an example that illustrates the difference:

my $x = chr(174); binmode STDOUT, ':utf8'; print "x has length ", length($x), " >>$x<<\n";
and this emits:
x has length 1 >>®<<
So, even though $x has length 1, it looks like it has length 2 when printed out. On the other hand, it also could have length 2:
my $x = chr(194).chr(174); binmode STDOUT, ':bytes'; print "x has length ", length($x), " >>$x<<\n";
and this emits:
x has length 2 >>®<<

The upshot is that if $x has length 1, your string probably contains Unicode code-points, and you'll likely want to look into using the encode_entities function from the module HTML::Entities. This is a general way to convert code-points to HTML entity references.

On the other hand, if $x has length 2, then your string probably contains UTF8 encoded characters. You would then likely find it advantageous to convert that UTF8 stream into Unicode code-points using the encode function from the Encode module like this:

use Encode; my $code_points = encode('utf8', $x);
The reason you would like to use code-points in your program rather than UTF8 bytes is that perl is much more adept at handling strings when they are stored as code-points.

Replies are listed 'Best First'.
Re^2: matching special char
by gman (Friar) on May 06, 2008 at 01:08 UTC

    Thanks for your reply,

    I tested the string:

    my $x = chr(174); binmode STDOUT, ':utf8'; print "x has length ", length($x), " >>$x<<\n";

    It showed up as one char,

    my $string = chr(174); my $contents =~ /$x/\&reg;/g;

    This results in the proper substitution, I did search for an extended ascii table for the symbol, but somehow missed it. I will be looking up more information on the two solutions you showed.

    Thanks again,