in reply to Re: Global substitution of non-base-plane Unicode characters
in thread Global substitution of non-base-plane Unicode characters

Thanks for the reference to Data::Dumper, very useful.

Yes, I am sure that I am not getting the translation in my code since I did follow the "Basic debugging checklist" item 2, using print and printf to dump the results of the substitutions and the intermediate variable $s_ after each attempt. Using your suggested Data:Dumper code I also confirmed the print* results. Here is my code updated with your Data::Dumper suggestion and a few additional printf's, followed by the output results:

use strict; use warnings; use utf8; use feature 'unicode_strings'; use Data::Dump qw/ dd /; my $txt; my $tx1; my $s_; my $TestCh1; my $TestCh2; binmode STDOUT, ':encoding(UTF-8)'; printf "\x{FEFF}"; # $txt = "This =>\N{U+100049}<= is a Unicode character in Plane 16"; $txt = "This =>&#1048649;<= is a Unicode character in Plane 16"; printf "Dumping \$txt="; dd( $txt ); $tx1 = $txt; $tx1 =~ s/"\\N{U+100049}"/"\N{U+2190}"/ge; print "0:\$txt=" . $tx1; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; $tx1 = $txt; $tx1 =~ s/\\xF4\\x80\\x81\\x89/"\\N{U+2190}"/ge; print "1:\$txt=" . $tx1; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; $tx1 = $txt; $TestCh1 = "\\xF4\\x80\\x81\\x89"; $TestCh2 = "\\N{U+2190}"; ($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g; print "2:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" . + $TestCh2 . "!\n"; $tx1 =~ s/$TestCh1/eval $s_/ge; print "2:\$tx1=" . $tx1 . "!\n"; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; $tx1 = $txt; $TestCh2 = "\\xE2\\x86\\x90"; ($s_ = '"'.($TestCh2).'"') =~ s/&/\$&/g; print "3:\$s_=" . $s_ . "!, \$TestCh1=" . $TestCh1 . "!, \$TestCh2=" . + $TestCh2 . "!\n"; $tx1 =~ s/$TestCh1/eval $s_/ge; print "3:\$tx1=" . $tx1 . "!"; print "\n"; printf "Dumping \$tx1="; dd( $tx1 ); print "\n"; __END__ Dumping $txt="This =>\x{100049}<= is a Unicode character in Plane 16" 0:$txt=This =>&#1048649;<= is a Unicode character in Plane 16 Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" 1:$txt=This =>&#1048649;<= is a Unicode character in Plane 16 Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" 2:$s_="\N{U+2190}"!, $TestCh1=\xF4\x80\x81\x89!, $TestCh2=\N{U+2190}! 2:$tx1=This =>&#1048649;<= is a Unicode character in Plane 16! Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16" 3:$s_="\xE2\x86\x90"!, $TestCh1=\xF4\x80\x81\x89!, $TestCh2=\xE2\x86\x +90! 3:$tx1=This =>&#1048649;<= is a Unicode character in Plane 16! Dumping $tx1="This =>\x{100049}<= is a Unicode character in Plane 16"

Your use of curly braces as quoting characters in your example is a little bit confusing to me. Won't simple slashes do just as well in your example? Or is there some other subtle reason to use the braces instead?

What's confusing to me in the a2p translation of the gsub function is that in the substitution expression the replacement text variable $s_ is preceded by an "eval" and the suffix modifier "e" is also used which is supposed to do an "eval" on the replacement expression. Does that mean that the variable $s_ is "eval"ed twice? And if that is true, why is it done that way?

Thanks again for your help.

Peter

Replies are listed 'Best First'.
Re^3: Global substitution of non-base-plane Unicode characters
by Anonymous Monk on Feb 24, 2014 at 00:59 UTC

    Yes, I am sure that I am not getting the translation in my code since

    Well compare your 1st substitution with mine, yours has \\N{...} which is not the same as \N (a \ escapes a \ so \\ means a literal \)

    your second substitution maybe looks broken by perlmonks only supporting latin-1 charset
    s/&/\$&/g; is that what you had?

    your third substitution attemps to replace raw utf bytes representation of that unicode code points .. which might work if you're dealing with bytes/octets in your string but you're dealing with unicode codepoints not their byte representations ( perlunitut, perlunitut: Unicode in Perl ) probably a2p eval eval eval stuff

    Basically, way too much eval/interpolation/eval/indirection going to , too much overlap to reason about , too much stuff to visually compare

    notice my example, very short, deals with one small string , and naturally works ;; copy my example :)

    so my suggestion, stop doing too many operations in one (assign, concatenate , and substitute .. forget an eval )

    Only deal with one string, one operation, one DDumper, ... final operation (substitution) final DDumper, end of program -- do a ddumper after every operation so you can notice the changes in the byte representations

    :)

    Your use of curly braces as quoting characters in your example is a little bit confusing to me. Won't simple slashes do just as well in your example? Or is there some other subtle reason to use the braces instead?

    There is no subtle reason other than habit -- I often start by typing my examples on the commandline ( cmd.exe ) so I avoid ' and " because of that -- and I avoid  qq[] because of perlmonks (thats how you link on perlmonks ), and I avoid // so it doesn't look like m/// or s/// ... I really don't think about it much and I switch between all these between the hours of the day ... but {} is used for a lot of things in a lot of programming languages, so force of habit forces {} ... speaking of habits () is used even more but it always kinda irks me being up there above {} :D

    Does that mean that the variable $s_ is "eval"ed twice?

    No, its evaled "once" (for each match)
    s{regex}{string} means replace that matched by regex with string
    s{regex}{code}e means replace that matched by regex with result of code; the e in s///e tells the s///ubstitution operator that the s//STUFFHERE/e is code and not a string
    s//STUFFHERE/e means treat stuffhere string as code
    two ee-s in s//STUFFHERE/ee means treat stuffhere string as code, and treat the return value of that code as code
    s//code/ee means s//eval code/e

    And if that is true, why is it done that way?

    Its an implementation detail -- simplest way to cover all possibilities ; a2p is a c program that parses awk programs and prints out perl programs ;; its computer generated code ( using byacc Compiler-compiler)

    Like I said earlier, not a great way to learn perl ... great as a foothold for switchover from awk, but not a substitute for perlintro

      OK, thanks for all your explanation and advice -- I will try to keep it simple and ignore the a2p sophistication.

      Regards,

      Peter

Re^3: Global substitution of non-base-plane Unicode characters
by Anonymous Monk on Feb 24, 2014 at 01:02 UTC
    Don't use printf as a substitute for print, the first (string) argument to printf is a template

      In this case, using printf instead of print is justified and, in fact, smart. The default value of the predefined variable $\ ($OUTPUT_RECORD_SEPARATOR) is undef, which is what Peter wants and expects here. But a very surprising and potentially elusive bug can be introduced into Peter's program when the value of $\ is later changed. Using printf in this admittedly unusual way ensures that the Unicode byte order mark is never followed by any unexpected character such as newline (\n).

      Jim

        If "the value of $\ is later changed" was a genuine concern, a better way would be to explicitly code the following rather than expecting a subsequent maintainer to automatically realise why printf was used here:

        ... { local $\; print "\x{FEFF}"; } ...

        And, of course, a much better way to change $\ in the middle of the program, would be along these lines:

        ... code as it is now ... # later changes: ... { local $\ = "\n"; ... code using changed $\ ... } ...

        -- Ken

      Thanks for reminding me of that. My purpose there was to avoid the automatic "\n" appended by print, so that the UTF-8 BOM is just the first thing written to the output.

      Peter

        "My purpose there was to avoid the automatic "\n" appended by print, ..."

        print does not do this. From its documentation:

        "The current value of $\ (if any) is printed after the entire LIST has been printed."

        $\ is the output record separator. From the "perlvar: Variables related to filehandles" documentation:

        "The output record separator for the print operator. If defined, this value is printed after the last of print's arguments. Default is undef."

        I can't see anywhere in the code you posted that you have explicited defined $\ (e.g. $\ = "\n").

        Check your shebang line (not shown in any of your posted code) for a "-l" switch. This is probably the most likely cause of "the automatic "\n" appended by print". See "perlrun: Command Switches" for details of the "-l" switch.

        -- Ken