comment on

The node A UTF8 round trip with MySQL and specifically Re: A UTF8 round trip with MySQL seems to have arrived at a coincidental time for me. With all this conflicker talk someone here ran an nmap scan and it crashed my perl server:

utf8 "\x80" does not map to Unicode at Queue.pm line 835, <GEN6234> li
+ne 1.
Malformed UTF-8 character (unexpected continuation byte 0x80, with no 
+preceding start byte) in pattern match (m//) at Queue.pm line 836, <G
+EN6234> line 1.
utf8 "\xD7" does not map to Unicode at Queue.pm line 835, <GEN6238> li
+ne 1.
utf8 "\xA4" does not map to Unicode at Queue.pm line 835, <GEN6239> li
+ne 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
utf8 "\x80" does not map to Unicode at Queue.pm line 835, <GEN6242> li
+ne 1. utf8 "\xE0" does not map to Unicode at
Queue.pm line 835, <GEN6245> line 1.
Segmentation fault
[download]

The code in question accepts UTF8 encoded data (well it is supposed to be encoded) from a socket and has set :utf8 I/O layer on the socket. The code generating the warnings is reading from the said socket.

There are a few things about this and the nodes quoted I don't understand.

what exactly is the difference between :utf8 and :encoding(UTF8) as Juerd seems to be suggesting :utf8 should not be used and b) simply sets the internal utf8 flag and yet I am getting warnings out suggesting slightly more than a flag set is occurring.
why is this segfaulting.
why when I change to use :encoding(UTF8) it stops segfaulting but slows down a lot.

As I quick test I got hold of a jpg file (obviously not utf8 encoded) and did:

use strict;
use warnings;

my $fh;
open ($fh, "<:utf8", "schema.jpg");
my $img = '';
while (<$fh>) {
    $img .= $_;
}
[download]

which takes 0.123s to run and outputs a lot of warnings. Changing to use :encoding(UTF8) takes 27s and outputs hundreds of warnings.

In reply to :utf8 I/O layer vs encoding(UTF8), segfault and speed by mje

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.