Re^5: Malformed UTF-8

Replies are listed 'Best First'.
Re^6: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 17:33 UTC
This is the output for both $term and $token just before the error: `TOKEN: SV = PVMG(0x1c6cca0) at 0x1ab6a28 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x2dd2f80 "ba\303\261o"\0 [UTF8 "ba\x{f1}o"] CUR = 5 LEN = 15 MAGIC = 0x2dc95f0 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 4 -------------------- TERM: SV = PVIV(0x18b8e20) at 0x1ab69c8 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) IV = 0 PV = 0x2dd3110 "ba\303\261o"\0 CUR = 5 LEN = 12 --------------------` [download] It appears $term is not actually UTF-8 encoded when this occurs. Additionally, is it me or does `[UTF8 "ba\x{f1}o"]` [download] look wrong ? From what i recall, the UTF8 part of Dump should show the actual word, meaning bano (accented n) and not the encoding.	[reply] [d/l] [select]
Re^7: Malformed UTF-8 by Joost (Canon) on May 15, 2007 at 17:41 UTC
It appears $term is not actually UTF-8 encoded when this occurs. No, it IS utf-8 encoded, perl just doesn't know that it is. And that can cause all kinds of crap. If you're reading $term from a handle (or reading any string from an encoded handle), you should set the handle's encoding using binmode. (i.e. `binmode HANDLE,":utf8";`) before reading from it. Or you can specify the :utf8 layer when you open() the file. About the `[UTF8 "ba\x{f1}o"]` - note that \x{f1} does NOT specify an encoding. It's the literal notation for the 241st letter of the unicode set (which is also the 241st letter of the latin-1 set, i.e. "ñ" eq "\x{f1}") with the advantage that it's 7-bit ASCII so it will print correctly (almost) everywhere no matter if your output expects utf-8, latin-1 or latin-15 etc. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^8: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 17:51 UTC
Thank you very much. This might indeed be the root of the problem. I will have a closer look.	[reply]
Re^7: Malformed UTF-8 by Joost (Canon) on May 15, 2007 at 17:48 UTC
Anyway, I updated my test program to something that should now replicate your error: `use strict; use warnings; use Devel::Peek 'Dump'; my $token = "ba\x{f1}o"; utf8::upgrade($token); # force utf-8 encoding and flag. my $term = "ba\303\261o"; # "utf-8 encoded" but no flag. warn "Token:\n"; Dump($token); warn "Term:\n"; Dump($term); print "match\n" if $token =~ /^$term/i;` [download] Note that this runs fine (i.e. no match, no error) on my system (5.8.8 built for i686-linux-thread-multi). "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^8: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 17:55 UTC
`bunny:/tmp spiros$ perl test.pl Token: SV = PV(0x1801460) at 0x180bcf0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x30afc0 "ba\303\261o"\0 [UTF8 "ba\x{f1}o"] CUR = 5 LEN = 6 Term: SV = PV(0x1801484) at 0x180bce4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300fa0 "ba\303\261o"\0 CUR = 5 LEN = 6 Malformed UTF-8 character (unexpected non-continuation byte 0x00, imme +diately after start byte 0xc3) in pattern match (m//) at test.pl line + 17.` [download] Hurray ! Thanks once more!	[reply] [d/l]
Re^9: Malformed UTF-8 by Joost (Canon) on May 15, 2007 at 18:12 UTC
Re^10: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 19:58 UTC