matching special char

gman has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: matching special char by pc88mxer (Vicar) on May 05, 2008 at 20:29 UTC
I have a feeling that you are dealing with UTF8 encoded data, but you also could be dealing with code-points, and we need to figure out which is the case. When you refer to ®, can you tell if it is stored as two perl characters or just one? That is, if you were to isolate ® into a string variable (say `$x`), what would `print length($x)` display? Here's an example that illustrates the difference: `my $x = chr(174); binmode STDOUT, ':utf8'; print "x has length ", length($x), " >>$x<<\n";` [download] and this emits: `x has length 1 >>®<<` [download] So, even though `$x` has length 1, it looks like it has length 2 when printed out. On the other hand, it also could have length 2: `my $x = chr(194).chr(174); binmode STDOUT, ':bytes'; print "x has length ", length($x), " >>$x<<\n";` [download] and this emits: `x has length 2 >>®<<` [download] The upshot is that if `$x` has length 1, your string probably contains Unicode code-points, and you'll likely want to look into using the `encode_entities` function from the module `HTML::Entities`. This is a general way to convert code-points to HTML entity references. On the other hand, if `$x` has length 2, then your string probably contains UTF8 encoded characters. You would then likely find it advantageous to convert that UTF8 stream into Unicode code-points using the `encode` function from the `Encode` module like this: `use Encode; my $code_points = encode('utf8', $x);` [download] The reason you would like to use code-points in your program rather than UTF8 bytes is that perl is much more adept at handling strings when they are stored as code-points.	[reply] [d/l] [select]
Re^2: matching special char by gman (Friar) on May 06, 2008 at 01:08 UTC
Thanks for your reply, I tested the string: `my $x = chr(174); binmode STDOUT, ':utf8'; print "x has length ", length($x), " >>$x<<\n";` [download] It showed up as one char, `my $string = chr(174); my $contents =~ /$x/\®/g;` [download] This results in the proper substitution, I did search for an extended ascii table for the symbol, but somehow missed it. I will be looking up more information on the two solutions you showed. Thanks again,	[reply] [d/l] [select]
Re: matching special char by mwah (Hermit) on May 05, 2008 at 19:19 UTC
Aside from the error pointed out by others already - is the data from some HTML source? Maybe it's ® or ®? `$contents =~ s/\®/\®/g; $contents =~ s/\®/\®/g;` [download] Can you tell us more about the source? Regards mwa	[reply] [d/l]
Re: matching special char by toolic (Bishop) on May 05, 2008 at 19:09 UTC
`use warnings; use strict; my $contents = 'foo®bar'; $contents =~ s/®/\&req;/g; print "contents=:$contents:\n";` [download] prints: `contents=:foo&req;bar:` [download]	[reply] [d/l] [select]
Re: matching special char by apl (Monsignor) on May 05, 2008 at 19:09 UTC
Off the top of my head, you should replace the double slash with a single slash. I assume you aren't using `use strict; use warning;`. If you were, you'd probably get errors on the regexp.	[reply] [d/l]