Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^6: Seeking Perl docs about how UTF8 flag propagates

by choroba (Cardinal)
on May 17, 2023 at 08:07 UTC ( [id://11152244]=note: print w/replies, xml ) Need Help??


in reply to Re^5: Seeking Perl docs about how UTF8 flag propagates
in thread Seeking Perl docs about how UTF8 flag propagates

> However substr() appears to propagate it only if the resulting substring has characters above 0x7f

What do you mean by "appears"?

I tried the following:

#!/usr/bin/perl use warnings; use strict; use Devel::Peek; my $s = "\N{LATIN SMALL LETTER S WITH CARON}i\N{LATIN SMALL LETTER C W +ITH CARON}"; for my $i (0 .. 2) { my $c = substr $s, $i, 1; Dump($c); }

Running it through  2>&1 | grep FLAGS outputs

FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8) FLAGS = (POK,pPOK,UTF8)

Update: Fixed the encoding of the code.

map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

Replies are listed 'Best First'.
Re^7: Seeking Perl docs about how UTF8 flag propagates
by hv (Prior) on May 17, 2023 at 12:13 UTC

    I looked at a small number of variations around the second case here:

    $x = "foo\x{100}"; chop($y = $x); $z = substr($x, 0, 1); Dump($z); # UTF8 $a = substr($y, 0, 1); Dump($a); # not UTF8

    Looking at the perl source, it looks like it treats it differently (and ends up not flagging as UTF8) if the byte length and the character length of the whole source string are the same. (Which is a potential efficiency concern: finding the character length of a large UTF8-flagged string is expensive.)

Re^7: Seeking Perl docs about how UTF8 flag propagates
by LanX (Saint) on May 17, 2023 at 11:21 UTC
    to get rid of the OS-grep dependency, try utf8::is_utf8

    use v5.12.0; use warnings; #use utf8; binmode STDOUT, ":utf8"; print my $s = "\N{LATIN SMALL LETTER S WITH CARON}i\N{LATIN SMALL LETT +ER C WITH CARON}"; say " is text" if utf8::is_utf8($s); for my $i (0 .. 2) { print my $c = substr $s, $i, 1; say " is text" if utf8::is_utf8($c); }

    šič is text
    š is text
    i is text
    č is text
    
    (in pre-tags b/c of PM restrictions)

    Please note that even after commenting use utf8 out, $s is still automatically flagged as text

    The pragma is optional here (not recommended), because

    • the functions are universally available in Perl
    • there is no wide-character in the source-code

    Cheers Rolf
    (addicted to the 𐍀𐌴𐍂𐌻 Programming Language :)
    Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11152244]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-19 22:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found