Re^2: good way to implement utf8::is

Replies are listed 'Best First'.

Re^3: good way to implement utf8::is_utf8 for perl 5.8.0
by ikegami (Patriarch) on Mar 19, 2009 at 17:49 UTC

I meant: What use do you have for is_utf8? I find it's usually used where the following should be used:

utf8::downgrade($s, 1)
    or croak("Wide characters in foo argument");
[download]

[reply]
[d/l]
[select]

Re^4: good way to implement utf8::is_utf8 for perl 5.8.0

by perl5ever (Pilgrim) on Mar 19, 2009 at 18:24 UTC

Well, since you asked...

I'm using it to debug a problem. I suspect using utf::downgrade is the solution, but I want to verify that.

Specifically, I am using LWP to make https connections and LWP is calling Crypt::SSLeay. I'm trying to track down a problem which I think is being caused by the header having the UTF8 flag set (even though it contains only ASCII characters).

The version of LWP I'm using is either very ancient or has been hacked -- it writes the header separately from the body. Newer versions of LWP concatenate the header and body before writing out the request. In my case this would immediately cause corruption when the body contained non-ASCII byte values.

However, I'm only seeing corruption some of the time. My suspicion is that the corruption is occurring somewhere in the bowls of Crypt::SSLeay -- something like:

1. LWP syswrites out the header (with UTF8 flag set)is written
2. Crypt::SSLeay writes out encrypted header
3. LWP syswrites out the body
4. Crypt::SSLeay writes out the encrypted body but perhaps an internal
+ buffer that it uses has the UTF8 flag set which corrupts the body.
[download]

The corruption is consistent with bytes with the high-bit set getting converted to their UTF8 encoding.

I've noticed that the most recent of LWP performs a downgrade of the header to ensure its internal representation is bytes. Guess there must have been a good reason for that change...

[reply]
[d/l]
[select]

Re^5: good way to implement utf8::is_utf8 for perl 5.8.0

by ikegami (Patriarch) on Mar 19, 2009 at 20:15 UTC

I'm using it to debug a problem.

I recommend Devel::Peek's Dump.

The corruption is consistent with bytes with the high-bit set getting converted to their UTF8 encoding.

Way too many XS modules access a string's buffer with no regard to the setting of the UTF8 flag. Sounds like it's happening here.

string	possible internal representations	default typemap to `char*`
`"\x{C9}"`	`C9; UTF8=0`	`C9`
`"\x{C9}"`	`C3,89; UTF8=1`	`C3,89`

I suspect using utf::downgrade is the solution, but I want to verify that.

It is indeed the solution. utf8::downgrade converts the internal encoding of a string from utf8 to bytes if it isn't already. The XS module should do that for you, but you can do it for the module.

string possible internal representations downgraded default typemap to char*

"\x{C9}" C9; UTF8=0 C9; UTF8=0 C9

C3,89; UTF8=1 C9; UTF8=0 C9

[reply]
[d/l]
[select]