in reply to Re: Never touch (or look at) the UTF8 flag!!
in thread Interventionist Unicode Behaviors

# Only one difference between these: the UTF8 flag is on for $uni

As far as I understand it this need not be true. You make the comment that you do a lot of XS programming. It seems to me that you are equating the XS concept of altering the PV with the perl concept of assigning to a string. Which with byte semantics is correct, but with utf8 semantics is not. I dont believe that there is an guarantee that this will always be true. For instance perl 5.12 could be entirley unicode internally and your program would break. Likewise, if the string were utf8-on before you did the assignment the result would be different. I think probably if you want to operate on the level you seem to I think you should use pack.

Also a little nit: perl doesnt do UTF-8, it does utf8, which is subtlely different from true UTF-8. Although in the context here I dont think it matters.

---
$world=~s/war/peace/g

  • Comment on Re^2: Never touch (or look at) the UTF8 flag!!

Replies are listed 'Best First'.
Re^3: Never touch (or look at) the UTF8 flag!!
by creamygoodness (Curate) on Sep 08, 2006 at 13:43 UTC

    Can Perl 5.12 start interpreting string literals with utf8 semantics without shattering backwards compatibility? I didn't think that was likely to happen. If it does it won't just be my examples here that break! :)

    I tried to manufacture an example to demo your assertion that if the string were utf8-on prior to assignment the result would differ, but I haven't been able to come up with anything. Certainly concatenating onto the end of an existing string the results would differ, but wholesale assignment?? How would you change the following example to illustrate your point?

    #!/usr/bin/perl use strict; use warnings; use Encode qw( _utf8_on ); use Devel::Peek qw( Dump ); my $foo; $foo = 'foo'; Dump($foo); my $bar; _utf8_on($bar); $bar = 'bar'; Dump($bar); __END__ Outputs: SV = PV(0x1801660) at 0x180b59c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x365f20 "foo"\0 CUR = 3 LEN = 4 SV = PV(0x1801678) at 0x180b56c REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300bd0 "bar"\0 CUR = 3 LEN = 4
    I also have to confess that I don't understand what you mean when you say that altering the PV is not the same as assigning to a string under utf8 semantics. Certainly you'd need to alter the PV (and other values, possibly flags too) following the rules of unicode handling -- for instance, checking for illegal sequences at splice points. Is that what you mean? I've spelunked a lot of the utf8 code in sv.c and I don't recall having seen any constructs I didn't recognize.

    And thanks for the suggestion on pack. Maybe if I'd used that in the OP there'd be less blood on the floor! :)

    --
    Marvin Humphrey
    Rectangular Research ― http://www.rectangular.com