Global substitution of non-base-plane Unicode characters

pjfarley3 has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I am a Perl newbie but a long-time gawk scripter who now needs to deal with text containing Unicode characters not in the base plane.

Specifically, I need to be able to globally substitute the Unicode left arrow character (U+2190) for a code point in SUPPLMENTARY PRIVATE USE AREA-B, Plane 16, U+100049, while processing some input text.

Initially I used the a2p utility to convert a working gawk script, which uses the hex byte equivalents of Unicode characters to accomplish the substitution with the gawk gsub function. The perl generated by a2p for the gsub function is a little hard to comprehend, so I tried to test how substitution should work with a much simpler perl script:

use strict;
use warnings;
use utf8;
use feature 'unicode_strings';

my $txt;
my $tx1;
my $s_;
my $TestCh1;
my $TestCh2;

binmode STDOUT, ':encoding(UTF-8)';
printf "\x{FEFF}";

# $txt = "This =>\N{U+100049}<= is a Unicode character in Plane 16";
$txt = "This =>&#1048649;<= is a Unicode character in Plane 16";

$tx1 = $txt;
$tx1 =~ s/"\\N{U+100049}"/"\N{U+2190}"/ge;

print "0:\$txt=" . $tx1;
print "\n\n";

$tx1 = $txt;
$tx1 =~ s/\\xF4\\x80\\x81\\x89/"\\N{U+2190}"/ge;

print "1:\$txt=" . $tx1;
print "\n\n";

$tx1 = $txt;
$TestCh1 = "\\xF4\\x80\\x81\\x89";
$TestCh2 = "\\N{U+2190}";

($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g;
print "2:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" .
+ $TestCh2 . "!\n";

$tx1 =~ s/$TestCh1/eval $s_/ge;
print "2:\$tx1=" . $tx1 . "!\n";
print "\n";

$tx1 = $txt;
$TestCh2 = "\\xE2\\x86\\x90";
($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g;
print "3:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" .
+ $TestCh2 . "!\n";

$tx1 =~ s/$TestCh1/eval $s_/ge;
print "3:\$tx1=" . $tx1 . "!\n";
print "\n";
[download]

However, none of these techniques seems to be working for me. The third and fourth techniques in the above program are using copies of the code generated by a2p for the gawk gsub function, so I thought they would work even if I did not understand the details of why, but they do not work. The U+1000049 character is never changed to U+2190.

Would you please enlighten me about how I should code this function in Perl? Also, a step-by-step explanation of the Perl code generated by a2p for the gawk gsub function would be much appreciated as a teaching tool to help me learn perl.

My environment is Win7-64, Strawberry Perl 5.18.2, if that makes a difference.

TIA for any help you can provide to cure my ignorance.

Peter

Comment on Global substitution of non-base-plane Unicode characters Download Code

Replies are listed 'Best First'.
Re: Global substitution of non-base-plane Unicode characters by Jim (Curate) on Feb 24, 2014 at 00:23 UTC
GNU Awk (gawk) isn't really Unicode-capable. Perl is. Using a2p in this case is just causing you needless confusion, especially if your objective is to learn how to handle Unicode using Perl. Your Unicode text substitution is trivially accomplished in Perl 5.18, which is the version you said you're using. `use strict; use warnings; use v5.16; binmode STDOUT, ':encoding(UTF-8)'; printf "\N{U+FEFF}"; # Unicode byte order mark my $text = "Unicode code point U+100049: \N{U+100049}\n"; print $text; $text =~ s/100049/002190/; $text =~ s/\N{U+100049}/\N{U+002190}/; print $text; exit 0; __END__` [download] Unicode code point U+100049: 􀁉 Unicode code point U+002190: ←	[reply] [d/l]
Re^2: Global substitution of non-base-plane Unicode characters by pjfarley3 (Initiate) on Feb 24, 2014 at 04:31 UTC
I have tried exactly that and it works just fine. Thank you very much! And I tested this version and it also works: `$txline = $_; $ch1 = "\N{U+100049}"; $ch2 = "\N{u+2190}"; $txline =~ s/$ch1/$ch2/g;` [download] Again many thanks, this helps my understanding a lot. Peter	[reply] [d/l]
Re: Global substitution of non-base-plane Unicode characters by Anonymous Monk on Feb 23, 2014 at 22:15 UTC
However, none of these techniques seems to be working for me. How do you know? Consider ddumper^{Basic debugging checklist} Basic debugging checklist item 4 `#!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw/ dd /; use feature qw/ unicode_strings /; my $uni = qq{\N{U+1000049} \N{U+2190}}; dd( $uni ); $uni =~ s{\N{U+1000049}}{\N{U+2190}}gx; dd( $uni ); __END__ "\x{1000049} \x{2190}" "\x{2190} \x{2190}"` [download] Also, a step-by-step explanation of the Perl code generated by a2p for the gawk gsub function would be much appreciated as a teaching tool to help me learn perl. Start with perlintro, perlrebackslash, perlrequick, and Re: calling awk one liner from perl Learning perl from a2p is going to be though	[reply] [d/l]
Re^2: Global substitution of non-base-plane Unicode characters by pjfarley3 (Initiate) on Feb 23, 2014 at 23:40 UTC
Thanks for the reference to Data::Dumper, very useful. Yes, I am sure that I am not getting the translation in my code since I did follow the "Basic debugging checklist" item 2, using print and printf to dump the results of the substitutions and the intermediate variable $s_ after each attempt. Using your suggested Data:Dumper code I also confirmed the print* results. Here is my code updated with your Data::Dumper suggestion and a few additional printf's, followed by the output results: use strict; use warnings; use utf8; use feature 'unicode_strings'; use Data::Dump qw/ dd /; my $txt; my $tx1; my $s_; my $TestCh1; my $TestCh2; binmode STDOUT, ':encoding(UTF-8)'; printf "\x{FEFF}"; # $txt = "This =>\N{U+100049}<= is a Unicode character in Plane 16"; $txt = "This =>􀁉<= is a Unicode character in Plane 16"; printf "Dumping \$txt="; dd( $txt ); $tx1 = $txt; $tx1 =~ s/"\\N{U+100049}"/"\N{U+2190}"/ge; print "0:\$txt=" . $tx1; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; $tx1 = $txt; $tx1 =~ s/\\xF4\\x80\\x81\\x89/"\\N{U+2190}"/ge; print "1:\$txt=" . $tx1; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; $tx1 = $txt; $TestCh1 = "\\xF4\\x80\\x81\\x89"; $TestCh2 = "\\N{U+2190}"; ($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g; print "2:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" . + $TestCh2 . "!\n"; $tx1 =~ s/$TestCh1/eval $s_/ge; print "2:\$tx1=" . $tx1 . "!\n"; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; $tx1 = $txt; $TestCh2 = "\\xE2\\x86\\x90"; ($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g; print "3:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" . + $TestCh2 . "!\n"; $tx1 =~ s/$TestCh1/eval $s_/ge; print "3:\$tx1=" . $tx1 . "!"; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; __END__ Dumping $txt="This =>\x{100049}<= is a Unicode character in Plane 16" 0:$txt=This =>􀁉<= is a Unicode character in Plane 16 Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" 1:$txt=This =>􀁉<= is a Unicode character in Plane 16 Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" 2:$s_="\N{U+2190}"!, $TestCh1=\xF4\x80\x81\x89!, $TestCh2=\N{U+2190}! 2:$tx1=This =>􀁉<= is a Unicode character in Plane 16! Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" 3:$s_="\xE2\x86\x90"!, $TestCh1=\xF4\x80\x81\x89!, $TestCh2=\xE2\x86\x +90! 3:$tx1=This =>􀁉<= is a Unicode character in Plane 16! Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" [download] Your use of curly braces as quoting characters in your example is a little bit confusing to me. Won't simple slashes do just as well in your example? Or is there some other subtle reason to use the braces instead? What's confusing to me in the a2p translation of the gsub function is that in the substitution expression the replacement text variable $s_ is preceded by an "eval" and the suffix modifier "e" is also used which is supposed to do an "eval" on the replacement expression. Does that mean that the variable $s_ is "eval"ed twice? And if that is true, why is it done that way? Thanks again for your help. Peter	[reply] [d/l]
Re^3: Global substitution of non-base-plane Unicode characters by Anonymous Monk on Feb 24, 2014 at 00:59 UTC
Yes, I am sure that I am not getting the translation in my code since Well compare your 1st substitution with mine, yours has \\N{...} which is not the same as \N (a \ escapes a \ so \\ means a literal \) your second substitution maybe looks broken by perlmonks only supporting latin-1 charset s/&/\$&/g; is that what you had? your third substitution attemps to replace raw utf bytes representation of that unicode code points .. which might work if you're dealing with bytes/octets in your string but you're dealing with unicode codepoints not their byte representations ( perlunitut, perlunitut: Unicode in Perl ) probably a2p eval eval eval stuff Basically, way too much eval/interpolation/eval/indirection going to , too much overlap to reason about , too much stuff to visually compare notice my example, very short, deals with one small string , and naturally works ;; copy my example :) so my suggestion, stop doing too many operations in one (assign, concatenate , and substitute .. forget an eval ) Only deal with one string, one operation, one DDumper, ... final operation (substitution) final DDumper, end of program -- do a ddumper after every operation so you can notice the changes in the byte representations :) Your use of curly braces as quoting characters in your example is a little bit confusing to me. Won't simple slashes do just as well in your example? Or is there some other subtle reason to use the braces instead? There is no subtle reason other than habit -- I often start by typing my examples on the commandline ( cmd.exe ) so I avoid ' and " because of that -- and I avoid `qq[]` because of perlmonks (thats how you link on perlmonks ), and I avoid // so it doesn't look like m/// or s/// ... I really don't think about it much and I switch between all these between the hours of the day ... but {} is used for a lot of things in a lot of programming languages, so force of habit forces {} ... speaking of habits () is used even more but it always kinda irks me being up there above {} :D Does that mean that the variable $s_ is "eval"ed twice? No, its evaled "once" (for each match) s{regex}{string} means replace that matched by regex with string s{regex}{code}e means replace that matched by regex with result of code; the e in s///e tells the s///ubstitution operator that the s//STUFFHERE/e is code and not a string s//STUFFHERE/e means treat stuffhere string as code two ee-s in s//STUFFHERE/ee means treat stuffhere string as code, and treat the return value of that code as code s//code/ee means s//eval code/e And if that is true, why is it done that way? Its an implementation detail -- simplest way to cover all possibilities ; a2p is a c program that parses awk programs and prints out perl programs ;; its computer generated code ( using byacc Compiler-compiler) Like I said earlier, not a great way to learn perl ... great as a foothold for switchover from awk, but not a substitute for perlintro	[reply] [d/l]
Re^4: Global substitution of non-base-plane Unicode characters by pjfarley3 (Initiate) on Feb 24, 2014 at 02:53 UTC
Re^3: Global substitution of non-base-plane Unicode characters by Anonymous Monk on Feb 24, 2014 at 01:02 UTC
Don't use printf as a substitute for print, the first (string) argument to printf is a template	[reply]
Re^4: Global substitution of non-base-plane Unicode characters by Jim (Curate) on Feb 24, 2014 at 04:00 UTC
Re^5: Global substitution of non-base-plane Unicode characters by kcott (Archbishop) on Feb 24, 2014 at 04:29 UTC
Some notes below your chosen depth have not been shown here
Re^4: Global substitution of non-base-plane Unicode characters by pjfarley3 (Initiate) on Feb 24, 2014 at 02:56 UTC
Re^5: Global substitution of non-base-plane Unicode characters by kcott (Archbishop) on Feb 24, 2014 at 03:34 UTC
Some notes below your chosen depth have not been shown here