comment on

Hi all,

I am a Perl newbie but a long-time gawk scripter who now needs to deal with text containing Unicode characters not in the base plane.

Specifically, I need to be able to globally substitute the Unicode left arrow character (U+2190) for a code point in SUPPLMENTARY PRIVATE USE AREA-B, Plane 16, U+100049, while processing some input text.

Initially I used the a2p utility to convert a working gawk script, which uses the hex byte equivalents of Unicode characters to accomplish the substitution with the gawk gsub function. The perl generated by a2p for the gsub function is a little hard to comprehend, so I tried to test how substitution should work with a much simpler perl script:

use strict;
use warnings;
use utf8;
use feature 'unicode_strings';

my $txt;
my $tx1;
my $s_;
my $TestCh1;
my $TestCh2;

binmode STDOUT, ':encoding(UTF-8)';
printf "\x{FEFF}";

# $txt = "This =>\N{U+100049}<= is a Unicode character in Plane 16";
$txt = "This =>&#1048649;<= is a Unicode character in Plane 16";

$tx1 = $txt;
$tx1 =~ s/"\\N{U+100049}"/"\N{U+2190}"/ge;

print "0:\$txt=" . $tx1;
print "\n\n";

$tx1 = $txt;
$tx1 =~ s/\\xF4\\x80\\x81\\x89/"\\N{U+2190}"/ge;

print "1:\$txt=" . $tx1;
print "\n\n";

$tx1 = $txt;
$TestCh1 = "\\xF4\\x80\\x81\\x89";
$TestCh2 = "\\N{U+2190}";

($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g;
print "2:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" .
+ $TestCh2 . "!\n";

$tx1 =~ s/$TestCh1/eval $s_/ge;
print "2:\$tx1=" . $tx1 . "!\n";
print "\n";

$tx1 = $txt;
$TestCh2 = "\\xE2\\x86\\x90";
($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g;
print "3:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" .
+ $TestCh2 . "!\n";

$tx1 =~ s/$TestCh1/eval $s_/ge;
print "3:\$tx1=" . $tx1 . "!\n";
print "\n";
[download]

However, none of these techniques seems to be working for me. The third and fourth techniques in the above program are using copies of the code generated by a2p for the gawk gsub function, so I thought they would work even if I did not understand the details of why, but they do not work. The U+1000049 character is never changed to U+2190.

Would you please enlighten me about how I should code this function in Perl? Also, a step-by-step explanation of the Perl code generated by a2p for the gawk gsub function would be much appreciated as a teaching tool to help me learn perl.

My environment is Win7-64, Strawberry Perl 5.18.2, if that makes a difference.

TIA for any help you can provide to cure my ignorance.

Peter

In reply to Global substitution of non-base-plane Unicode characters by pjfarley3

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.