Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: The Queensrÿche Situation

by Rodster001 (Pilgrim)
on Oct 19, 2014 at 19:24 UTC ( [id://1104338]=note: print w/replies, xml ) Need Help??


in reply to Re: The Queensrÿche Situation
in thread The Queensrÿche Situation

Yes! That fixes the printing problem in my terminal. And this makes complete sense now. Thank you for clearing this up!

One problem remains that I still don't quite understand.

#!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl; binmode STDOUT, ":utf8"; use utf8; my $string = "Queensrÿche"; no utf8; chars($string); (Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO +T utf8\n"; print "$string\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $q = qquote($_); print "\t$dec\t$chr\t$q\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_;
Why does that produce, this:
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 ÿ \x{ff} 99 c c 104 h h 101 e e this is utf8 Queensrÿche unaccented: Queensryche
Is that actually valid utf-8? Shouldn't the ÿ be two bytes (decimal 195 191)? Like this:
81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e

Replies are listed 'Best First'.
Re^3: The Queensrÿche Situation
by aitap (Curate) on Oct 19, 2014 at 20:01 UTC

    When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example,

    use utf8;
    binmode STDOUT, ":utf8";
    my $string = "Queensrÿche ы";
    printf "%x\t%s\n", ord($_), $_ for split "", $string;
    __END__
    51      Q
    75      u
    65      e
    65      e
    6e      n
    73      s
    72      r
    ff      ÿ
    63      c
    68      h
    65      e
    20       
    44b     ы
    

    If you need to work with utf-8 bytes, encode them back:

    use utf8;
    use Encode 'encode';
    binmode STDOUT, ":utf8";
    my $string = "Queensrÿche ы";
    printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string;
    __END__
    51      Q
    75      u
    65      e
    65      e
    6e      n
    73      s
    72      r
    c3      Ã
    bf      ¿
    63      c
    68      h
    65      e
    20       
    d1      Ñ
    8b
    
    But there would be no point in using utf8 and Encode in this case.

      Ok, this is all falling into place for me now. Thank you.
Re^3: The Queensrÿche Situation
by karlgoethebier (Abbot) on Oct 19, 2014 at 21:02 UTC
    "Yes! That fixes the printing problem in my terminal!"

    Thats nice. But just to add a little bit confusion., please see this:

    A One-Liner prints it out as expected:

    karl$ perl -e 'print qq(Queensrÿche\n)' Queensrÿche

    But please see what happens when i put the stuff into a script (in the same terminal session):

    #!/usr/bin/env perl use strict; use warnings; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS}); print qq($y_with_trema\n); $string = qq(Queensr) . $y_with_trema . qq(che); print qq($string\n); __END__ karls-mac-mini:monks karl$ ./roadster001.pl Queensrÿche ÿ Queensrÿche

    Seems like things are getting weird. I wonder when i ever will understand this crap.

    N.B.: I came in a bit late and didn't read all the posts yet.

    Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      I can actually answer this now :) Take a look below, hopefully that will clear it up.
      #!/usr/bin/env perl use strict; use warnings; use Encode; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8 +\n"; use utf8; $string = qq(Queensrÿche); no utf8; print qq($string\n); Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8 +\n";
      Ouput:
      Queensrÿche - is not utf8 Queensrÿche - is utf8
        "...hopefully that will clear it up."

        Thank you Rodster001 for your reply.

        Unfortunately this doesn't explain why the One-Liner prints correctly.

        But perhaps i got yet another mental block ;-)

        Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Re^3: The Queensrÿche Situation
by Rodster001 (Pilgrim) on Oct 19, 2014 at 20:00 UTC
    I figured it out, sort of. The first is actually ascii (255 maps to "ÿ"): http://www.ascii-code.com

    So, when I take the string "Queensrÿche" (which IS actually encoded as utf-8) for example:
    Decimal Char escaped 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e
    It is now printing on my terminal like this:
    Queensrÿche
    This makes sense, in a way, now because 195 maps to "Ã" and 191 maps to "¿". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1104338]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-04-26 06:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found