:utf8 I/O layer vs encoding(UTF8), segfault and speed

mje has asked for the wisdom of the Perl Monks concerning the following question:

The node A UTF8 round trip with MySQL and specifically Re: A UTF8 round trip with MySQL seems to have arrived at a coincidental time for me. With all this conflicker talk someone here ran an nmap scan and it crashed my perl server:

utf8 "\x80" does not map to Unicode at Queue.pm line 835, <GEN6234> li
+ne 1.
Malformed UTF-8 character (unexpected continuation byte 0x80, with no 
+preceding start byte) in pattern match (m//) at Queue.pm line 836, <G
+EN6234> line 1.
utf8 "\xD7" does not map to Unicode at Queue.pm line 835, <GEN6238> li
+ne 1.
utf8 "\xA4" does not map to Unicode at Queue.pm line 835, <GEN6239> li
+ne 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
Malformed UTF-8 character (overflow at 0xcd0b2000, byte 0x00, after st
+art byte 0xff) in subroutine entry at
/usr/lib/perl5/5.8.8/i386-linux-thread-multi/Data/Dumper.pm line 179, 
+<GEN6239> line 1.
utf8 "\x80" does not map to Unicode at Queue.pm line 835, <GEN6242> li
+ne 1. utf8 "\xE0" does not map to Unicode at
Queue.pm line 835, <GEN6245> line 1.
Segmentation fault
[download]

The code in question accepts UTF8 encoded data (well it is supposed to be encoded) from a socket and has set :utf8 I/O layer on the socket. The code generating the warnings is reading from the said socket.

There are a few things about this and the nodes quoted I don't understand.

what exactly is the difference between :utf8 and :encoding(UTF8) as Juerd seems to be suggesting :utf8 should not be used and b) simply sets the internal utf8 flag and yet I am getting warnings out suggesting slightly more than a flag set is occurring.
why is this segfaulting.
why when I change to use :encoding(UTF8) it stops segfaulting but slows down a lot.

As I quick test I got hold of a jpg file (obviously not utf8 encoded) and did:

use strict;
use warnings;

my $fh;
open ($fh, "<:utf8", "schema.jpg");
my $img = '';
while (<$fh>) {
    $img .= $_;
}
[download]

which takes 0.123s to run and outputs a lot of warnings. Changing to use :encoding(UTF8) takes 27s and outputs hundreds of warnings.

Comment on :utf8 I/O layer vs encoding(UTF8), segfault and speed Select or Download Code

Replies are listed 'Best First'.
Re: :utf8 I/O layer vs encoding(UTF8), segfault and speed by ikegami (Patriarch) on Apr 01, 2009 at 19:52 UTC
`:utf8` doesn't fix bad data. `$ perl -MDevel::Peek -we'my $buf = "\x80"; open my $fh, "<:utf8", \$bu +f or die; my $x = <$fh>; Dump $x;' utf8 "\x80" does not map to Unicode at -e line 1, <$fh> line 1. SV = PV(0x814fbb4) at 0x814f6cc REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x8170a78 "\200"\0Malformed UTF-8 character (unexpected continu +ation byte 0x80, with no preceding start byte) in subroutine entry at + -e line 1, <$fh> line 1. [UTF8 "\x{0}"] CUR = 1 LEN = 80` [download] `:encoding(UTF-8)` replaces the bad data (with the 4 chars '`\x80`' in this case). `$ perl -MDevel::Peek -we'my $buf = "\x80"; open my $fh, "<:encoding(UT +F-8)", \$buf or die; my $x = <$fh>; Dump $x;' utf8 "\x80" does not map to Unicode at -e line 1. SV = PV(0x814fbb4) at 0x814f6cc REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x8197048 "\\x80"\0 [UTF8 "\\x80"] CUR = 4 LEN = 80` [download] I can't answer your other questions.	[reply] [d/l] [select]
Re: :utf8 I/O layer vs encoding(UTF8), segfault and speed by almut (Canon) on Apr 01, 2009 at 18:49 UTC
1. what exactly is the difference between :utf8 and :encoding(UTF8) As Juerd says: :utf8 sets the SvUTF8 flag on input without validating it. :encoding(utf8) properly decodes input, ensuring that it is safe. What exactly isn't clear about that? Properly decoding/validating of course takes time. OTOH, if you let perl work with unvalidated 'UTF-8' strings, nasty things can happen (including segfaults), because Perl's unicode internals have not been implemented to handle this safely in each and every case... Strings which are not properly encoded in UTF-8 should not have the utf8 flag on.	[reply]
Re^2: :utf8 I/O layer vs encoding(UTF8), segfault and speed by mje (Curate) on Apr 01, 2009 at 18:55 UTC
ok, I get that but where do the warnings/errors come from? What I mean is are the errors separate from encoding problems	[reply]
Re^3: :utf8 I/O layer vs encoding(UTF8), segfault and speed by almut (Canon) on Apr 01, 2009 at 19:22 UTC
What I mean is are the errors separate from encoding problems I only see two types of errors: `"utf8 "\x.." does not map to Unicode"`, and `"Malformed UTF-8 character (...details...)"`, both of which indicate encoding problems due to malformed input.	[reply] [d/l] [select]
Re^4: :utf8 I/O layer vs encoding(UTF8), segfault and speed by mje (Curate) on Apr 01, 2009 at 19:34 UTC
Re^5: :utf8 I/O layer vs encoding(UTF8), segfault and speed by almut (Canon) on Apr 01, 2009 at 19:44 UTC