Re^2: The Queensrÿche Situation

Yes! That fixes the printing problem in my terminal. And this makes complete sense now. Thank you for clearing this up!

One problem remains that I still don't quite understand.


#!/usr/bin/perl

use strict;
use Encode;
use Text::Unaccent::PurePerl;

binmode STDOUT, ":utf8"; 

use utf8;
my $string = "Queensrÿche";
no utf8;

chars($string);

(Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO
+T utf8\n";

print "$string\n";

print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) 
+. "\n";

exit;

sub chars {

        my $k = shift;

        my @chars = split("",$k);

        foreach (@chars) {

                my $dec = ord($_);
                my $chr = chr(ord($_));
                my $q = qquote($_);

                print "\t$dec\t$chr\t$q\n";
        }
}


sub qquote {

        local($_) = shift;

        s/([\\\"\@\$])/\\$1/g;

        my $bytes; { use bytes; $bytes = length }

        s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes 
+> length;

        return $_;
[download]

Why does that produce, this:


        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        255     ÿ       \x{ff}
        99      c       c
        104     h       h
        101     e       e
this is utf8
Queensrÿche
unaccented: Queensryche
[download]

Is that actually valid utf-8? Shouldn't the ÿ be two bytes (decimal 195 191)? Like this:

        81      Q       Q
        117     u       u
        101     e       e
        101     e       e
        110     n       n
        115     s       s
        114     r       r
        195     -       \x{c3}
        191     -       \x{bf}
        99      c       c
        104     h       h
        101     e       e
[download]

Comment on Re^2: The Queensrÿche Situation Select or Download Code

Replies are listed 'Best First'.
Re^3: The Queensrÿche Situation by aitap (Curate) on Oct 19, 2014 at 20:01 UTC
When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example, use utf8; binmode STDOUT, ":utf8"; my $string = "Queensrÿche ы"; printf "%x\t%s\n", ord($_), $_ for split "", $string; __END__ 51 Q 75 u 65 e 65 e 6e n 73 s 72 r ff ÿ 63 c 68 h 65 e 20 44b ы If you need to work with utf-8 bytes, encode them back: use utf8; use Encode 'encode'; binmode STDOUT, ":utf8"; my $string = "Queensrÿche ы"; printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string; __END__ 51 Q 75 u 65 e 65 e 6e n 73 s 72 r c3 Ã bf ¿ 63 c 68 h 65 e 20 d1 Ñ 8b But there would be no point in using utf8 and Encode in this case.	[reply]
Re^4: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 21:00 UTC
Ok, this is all falling into place for me now. Thank you.	[reply]
Re^3: The Queensrÿche Situation by karlgoethebier (Abbot) on Oct 19, 2014 at 21:02 UTC
"Yes! That fixes the printing problem in my terminal!" Thats nice. But just to add a little bit confusion., please see this: A One-Liner prints it out as expected: `karl$ perl -e 'print qq(Queensrÿche\n)' Queensrÿche` [download] But please see what happens when i put the stuff into a script (in the same terminal session): `#!/usr/bin/env perl use strict; use warnings; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS}); print qq($y_with_trema\n); $string = qq(Queensr) . $y_with_trema . qq(che); print qq($string\n); __END__ karls-mac-mini:monks karl$ ./roadster001.pl QueensrÃ¿che ÿ Queensrÿche` [download] Seems like things are getting weird. I wonder when i ever will understand this crap. N.B.: I came in a bit late and didn't read all the posts yet. Best regards, Karl «The Crux of the Biscuit is the Apostrophe»	[reply] [d/l] [select]
Re^4: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 21:38 UTC
I can actually answer this now :) Take a look below, hopefully that will clear it up. `#!/usr/bin/env perl use strict; use warnings; use Encode; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8 +\n"; use utf8; $string = qq(Queensrÿche); no utf8; print qq($string\n); Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8 +\n";` [download] Ouput: `QueensrÃ¿che - is not utf8 Queensrÿche - is utf8` [download]	[reply] [d/l] [select]
Re^5: The Queensrÿche Situation by karlgoethebier (Abbot) on Oct 21, 2014 at 07:56 UTC
"...hopefully that will clear it up." Thank you Rodster001 for your reply. Unfortunately this doesn't explain why the One-Liner prints correctly. But perhaps i got yet another mental block ;-) Best regards, Karl «The Crux of the Biscuit is the Apostrophe»	[reply]
Re^3: The Queensrÿche Situation by Rodster001 (Pilgrim) on Oct 19, 2014 at 20:00 UTC
I figured it out, sort of. The first is actually ascii (255 maps to "ÿ"): http://www.ascii-code.com So, when I take the string "Queensrÿche" (which IS actually encoded as utf-8) for example: `Decimal Char escaped 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e` [download] It is now printing on my terminal like this: `QueensrÃ¿che` [download] This makes sense, in a way, now because 195 maps to "Ã" and 191 maps to "¿". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on).	[reply] [d/l] [select]