in reply to Re^3: german Alphabet
in thread german Alphabet

...use a UTF-8 encoding on the source code and use utf8;

I have found this to be a very informative thread and ikegami's comments illuminating. Some issues require comment with attending source, so that others can replicate. I have enjoyed replicating haukex's source and wicked use of the command line to clone a script with a use statement commented out. That said, I don't understand current output.

I use my clone tool on haukex's script to get a filename in my nomenclature:

$ ./2.create.bash with_utf8.pl The shebang is specifying bash Using bash 4.4.19(1)-release 1 1.pl -rwxr-xr-x 1 bob bob 125 Dec 6 11:54 1.pl $ file -i *.pl ... 1.pl: text/x-perl; charset=utf-8 2.excel.pl: text/x-perl; charset=us-ascii ... 5.ping4.pl: text/x-perl; charset=us-ascii 6.excel.pl: text/x-perl; charset=us-ascii latin1.pl: text/plain; charset=iso-8859-1 without_utf8.pl: text/plain; charset=utf-8 with_utf8.pl: text/plain; charset=utf-8 $

I then use his nifty shell command:

$ perl -pe 's/^(?=.*utf8)/#/' 1.pl | tee 1.without_utf8.pl #!/usr/bin/perl -w use 5.011; #use utf8; use Devel::Peek; my $string = 'Gödel'; Dump($string); $string = 'über'; Dump($string); $string = 'alleß'; Dump($string); $

The original is unable to render the special charcters in STDOUT. Uncertain what happens in code tags:

$ ./1.pl string is G�del SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af8a2d730 "G\303\266del"\0 [UTF8 "G\x{f6}del"] CUR = 6 LEN = 10 COW_REFCNT = 1 string is �ber SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af8a124e0 "\303\274ber"\0 [UTF8 "\x{fc}ber"] CUR = 5 LEN = 10 COW_REFCNT = 1 string is alle� SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af89f9eb0 "alle\303\237"\0 [UTF8 "alle\x{df}"] CUR = 6 LEN = 10 COW_REFCNT = 1 $

BUT, (this part is surprising to me), the umlauts are legible in STDOUT for the version with use utf8 commented out. They will probably get shredded in code tags:

$ ./1.without_utf8.pl string is Gödel SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c052d400 "G\303\266del"\0 CUR = 6 LEN = 10 COW_REFCNT = 1 string is über SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c0515040 "\303\274ber"\0 CUR = 5 LEN = 10 COW_REFCNT = 1 string is alleß SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c050d0d0 "alle\303\237"\0 CUR = 6 LEN = 10 COW_REFCNT = 1 $

I tried to switch the encoding to us-ascii using a command similar to what you used but fail to find the correct syntax:

$ iconv -f UTF-8 -t us-ascii 1.pl -o 1.us-ascii.pl iconv: illegal input sequence at position 72 $

Also, I'm not sure what I'm to be gleaning from Devel::Peek. Is the idea that you get to see what perl's internal representation of a string is?

Replies are listed 'Best First'.
Re^5: german Alphabet
by ikegami (Patriarch) on Dec 08, 2018 at 06:14 UTC

    You get to see Perl's internal representations of scalars and its "subclasses" (arrays, hashes, globs, etc). See illguts for documentation on these. (Grab the tarball and look at the files named index-*.html or illguts-*.pdf.)

    The transcoding failure is the result of "ö", "ü" and "ß" not being in the US-ASCII character set.

Re^5: german Alphabet
by Aldebaran (Curate) on Dec 12, 2018 at 00:27 UTC
    I tried to switch the encoding to us-ascii using a command similar to what you used but fail to find the correct syntax:
    $ iconv -f UTF-8 -t us-ascii 1.pl -o 1.us-ascii.pl iconv: illegal input sequence at position 72 $

    I finally realized what that was when I took a look inside the unexpected partial file that resulted from this command: 1.us-ascii.pl:

    $ cat 1.ascii.pl #!/usr/bin/perl -w use 5.011; #use utf8; use Devel::Peek; my $string = 'G$

    So, position 72 was where the first umlaut occurred, and now I at least understand the error.