Re^2: german Alphabet

Replies are listed 'Best First'.
Re^3: german Alphabet by haukex (Archbishop) on Dec 04, 2018 at 21:02 UTC
I don't see in ikegami's script the need for `use utf8;`. The OP as well as ikegami's script contain the string `'Fräsen und ndk (Kamera - Fräsaufnahme)'`. From utf8: "The use utf8 pragma tells the Perl parser to allow UTF-8 in the program text in the current lexical scope. ... Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. ... Because it is not possible to reliably tell UTF-8 from native 8 bit encodings, you need either a Byte Order Mark at the beginning of your source code, or `use utf8;`, to instruct perl." Although the "ä" may ~~happen~~ appear to work because it's part of the Latin1 character set, ~~which Perl typically uses internally~~, it will most likely not do what you want on any Unicode characters outside of that set. As you can see below, the only version of the code in which the `UTF8` is flag properly set on the string is the one where the source is encoded as UTF-8 and `use utf8;` is used. The rule of thumb I always use is to either work entirely in ASCII (using escapes such as `\N{}` to specify Unicode characters), or otherwise use a UTF-8 encoding on the source code and `use utf8;`. See also perluniintro and perlunicode. $ cat with_utf8.pl use warnings; use strict; use utf8; use Devel::Peek; my $string = 'Fräsen und ndk (Kamera - Fräsaufnahme)'; Dump($string); $ perl -pe 's/^(?=.utf8)/#/' with_utf8.pl \| tee without_utf8.pl use warnings; use strict; #use utf8; use Devel::Peek; my $string = 'Fräsen und ndk (Kamera - Fräsaufnahme)'; Dump($string); $ iconv -f UTF-8 -t Latin1 without_utf8.pl -o latin1.pl $ file -i .pl latin1.pl: text/plain; charset=iso-8859-1 without_utf8.pl: text/plain; charset=utf-8 with_utf8.pl: text/plain; charset=utf-8 $ perl latin1.pl SV = PV(0x1365d70) at 0x13855c0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x13d7160 "Fr\344sen und ndk (Kamera - Fr\344saufnahme)"\0 CUR = 38 LEN = 40 COW_REFCNT = 1 $ perl without_utf8.pl SV = PV(0xa15d70) at 0xa355c0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0xa87190 "Fr\303\244sen und ndk (Kamera - Fr\303\244saufnahme)" +\0 CUR = 40 LEN = 42 COW_REFCNT = 1 $ perl with_utf8.pl SV = PV(0x18d5d70) at 0x18f55d8 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x19384a0 "Fr\303\244sen und ndk (Kamera - Fr\303\244saufnahme) +"\0 [UTF8 "Fr\x{e4}sen und ndk (Kamera - Fr\x{e4}saufnahme)"] CUR = 40 LEN = 42 COW_REFCNT = 1 [download] Updated as per ikegami's reply.	[reply] [d/l] [select]
Re^4: german Alphabet by ikegami (Patriarch) on Dec 05, 2018 at 09:19 UTC
Perl assumes ASCII, not latin-1. `$ perl -Mutf8 -MEncode -e'print encode("latin-1", "sub fête {}\n");' \ \| perl Illegal declaration of subroutine main::f at - line 1.` [download] If you happen to use an 8-bit byte in string literal, a character with the value of the byte will be created rather than throwing an error.	[reply] [d/l]
Re^5: german Alphabet by Anonymous Monk on Dec 15, 2018 at 19:51 UTC
It might be important to note that when one tries to print a wide string that happens to be representable in latin-1, Perl uses latin-1 with no warnings: `$ perl -w -Mutf8 -E'print "ê"' \| hd 00000000 ea \|.\| 00000001` [download] `"ê"` is decoded into characters but then printed to a handle that doesn't have an `:encode(...)` or `:utf8` IOLayer. Since it's representable in latin-1, the single-byte encoding is used and no warning is shown. $ perl -w -Mutf8 -E'print "ы"' \| hd Wide character in print at -e line 1. 00000000 d1 8b \|..\| 00000002 Similar situation, but `"ы"` cannot be represented in latin-1, so we get a warning and UTF-8 bytes instead. `$ perl -w -E'print "ê"' \| hd 00000000 c3 aa \|..\| 00000002` [download] (My terminal is UTF-8. No decoding or encoding is done in this case, Perl operates on bytes.)	[reply] [d/l] [select]
Re^6: german Alphabet by ikegami (Patriarch) on Dec 16, 2018 at 19:53 UTC
Re^7: german Alphabet by Anonymous Monk on Dec 16, 2018 at 21:47 UTC
Re^4: german Alphabet by Aldebaran (Curate) on Dec 07, 2018 at 23:44 UTC
...use a UTF-8 encoding on the source code and use utf8; I have found this to be a very informative thread and ikegami's comments illuminating. Some issues require comment with attending source, so that others can replicate. I have enjoyed replicating haukex's source and wicked use of the command line to clone a script with a use statement commented out. That said, I don't understand current output. I use my clone tool on haukex's script to get a filename in my nomenclature: `$ ./2.create.bash with_utf8.pl The shebang is specifying bash Using bash 4.4.19(1)-release 1 1.pl -rwxr-xr-x 1 bob bob 125 Dec 6 11:54 1.pl $ file -i .pl ... 1.pl: text/x-perl; charset=utf-8 2.excel.pl: text/x-perl; charset=us-ascii ... 5.ping4.pl: text/x-perl; charset=us-ascii 6.excel.pl: text/x-perl; charset=us-ascii latin1.pl: text/plain; charset=iso-8859-1 without_utf8.pl: text/plain; charset=utf-8 with_utf8.pl: text/plain; charset=utf-8 $` [download] I then use his nifty shell command: `$ perl -pe 's/^(?=.utf8)/#/' 1.pl \| tee 1.without_utf8.pl #!/usr/bin/perl -w use 5.011; #use utf8; use Devel::Peek; my $string = 'Gödel'; Dump($string); $string = 'über'; Dump($string); $string = 'alleß'; Dump($string); $` [download] The original is unable to render the special charcters in STDOUT. Uncertain what happens in code tags: $ ./1.pl string is G�del SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af8a2d730 "G\303\266del"\0 [UTF8 "G\x{f6}del"] CUR = 6 LEN = 10 COW_REFCNT = 1 string is �ber SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af8a124e0 "\303\274ber"\0 [UTF8 "\x{fc}ber"] CUR = 5 LEN = 10 COW_REFCNT = 1 string is alle� SV = PV(0x556af89dcda0) at 0x556af8a06fa0 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x556af89f9eb0 "alle\303\237"\0 [UTF8 "alle\x{df}"] CUR = 6 LEN = 10 COW_REFCNT = 1 $ [download] BUT, (this part is surprising to me), the umlauts are legible in STDOUT for the version with use utf8 commented out. They will probably get shredded in code tags: $ ./1.without_utf8.pl string is Gödel SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c052d400 "G\303\266del"\0 CUR = 6 LEN = 10 COW_REFCNT = 1 string is über SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c0515040 "\303\274ber"\0 CUR = 5 LEN = 10 COW_REFCNT = 1 string is alleß SV = PV(0x5584c04dbda0) at 0x5584c0505a88 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK) PV = 0x5584c050d0d0 "alle\303\237"\0 CUR = 6 LEN = 10 COW_REFCNT = 1 $ [download] I tried to switch the encoding to us-ascii using a command similar to what you used but fail to find the correct syntax: $ iconv -f UTF-8 -t us-ascii 1.pl -o 1.us-ascii.pl iconv: illegal input sequence at position 72 $ [download] Also, I'm not sure what I'm to be gleaning from Devel::Peek. Is the idea that you get to see what perl's internal representation of a string is?	[reply] [d/l] [select]
Re^5: german Alphabet by ikegami (Patriarch) on Dec 08, 2018 at 06:14 UTC
You get to see Perl's internal representations of scalars and its "subclasses" (arrays, hashes, globs, etc). See illguts for documentation on these. (Grab the tarball and look at the files named `index-.html` or `illguts-.pdf`.) The transcoding failure is the result of "ö", "ü" and "ß" not being in the US-ASCII character set.	[reply] [d/l] [select]
Re^5: german Alphabet by Aldebaran (Curate) on Dec 12, 2018 at 00:27 UTC
I tried to switch the encoding to us-ascii using a command similar to what you used but fail to find the correct syntax: $ iconv -f UTF-8 -t us-ascii 1.pl -o 1.us-ascii.pl iconv: illegal input sequence at position 72 $ [download] I finally realized what that was when I took a look inside the unexpected partial file that resulted from this command: 1.us-ascii.pl: `$ cat 1.ascii.pl #!/usr/bin/perl -w use 5.011; #use utf8; use Devel::Peek; my $string = 'G$` [download] So, position 72 was where the first umlaut occurred, and now I at least understand the error.	[reply] [d/l] [select]