Re^6: Malformed UTF-8

This is the output for both $term and $token just before the error:

TOKEN:
SV = PVMG(0x1c6cca0) at 0x1ab6a28
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  IV = 0
  NV = 0
  PV = 0x2dd2f80 "ba\303\261o"\0 [UTF8 "ba\x{f1}o"]
  CUR = 5
  LEN = 15
  MAGIC = 0x2dc95f0
    MG_VIRTUAL = &PL_vtbl_utf8
    MG_TYPE = PERL_MAGIC_utf8(w)
    MG_LEN = 4
--------------------
TERM:
SV = PVIV(0x18b8e20) at 0x1ab69c8
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  IV = 0
  PV = 0x2dd3110 "ba\303\261o"\0
  CUR = 5
  LEN = 12
--------------------
[download]

It appears $term is not actually UTF-8 encoded when this occurs. Additionally, is it me or does

[UTF8 "ba\x{f1}o"]
[download]

look wrong ? From what i recall, the UTF8 part of Dump should show the actual word, meaning bano (accented n) and not the encoding.

Comment on Re^6: Malformed UTF-8 Select or Download Code

Replies are listed 'Best First'.
Re^7: Malformed UTF-8 by Joost (Canon) on May 15, 2007 at 17:41 UTC
It appears $term is not actually UTF-8 encoded when this occurs. No, it IS utf-8 encoded, perl just doesn't know that it is. And that can cause all kinds of crap. If you're reading $term from a handle (or reading any string from an encoded handle), you should set the handle's encoding using binmode. (i.e. `binmode HANDLE,":utf8";`) before reading from it. Or you can specify the :utf8 layer when you open() the file. About the `[UTF8 "ba\x{f1}o"]` - note that \x{f1} does NOT specify an encoding. It's the literal notation for the 241st letter of the unicode set (which is also the 241st letter of the latin-1 set, i.e. "ñ" eq "\x{f1}") with the advantage that it's 7-bit ASCII so it will print correctly (almost) everywhere no matter if your output expects utf-8, latin-1 or latin-15 etc. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^8: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 17:51 UTC
Thank you very much. This might indeed be the root of the problem. I will have a closer look.	[reply]
Re^7: Malformed UTF-8 by Joost (Canon) on May 15, 2007 at 17:48 UTC
Anyway, I updated my test program to something that should now replicate your error: `use strict; use warnings; use Devel::Peek 'Dump'; my $token = "ba\x{f1}o"; utf8::upgrade($token); # force utf-8 encoding and flag. my $term = "ba\303\261o"; # "utf-8 encoded" but no flag. warn "Token:\n"; Dump($token); warn "Term:\n"; Dump($term); print "match\n" if $token =~ /^$term/i;` [download] Note that this runs fine (i.e. no match, no error) on my system (5.8.8 built for i686-linux-thread-multi). "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l]
Re^8: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 17:55 UTC
`bunny:/tmp spiros$ perl test.pl Token: SV = PV(0x1801460) at 0x180bcf0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x30afc0 "ba\303\261o"\0 [UTF8 "ba\x{f1}o"] CUR = 5 LEN = 6 Term: SV = PV(0x1801484) at 0x180bce4 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0x300fa0 "ba\303\261o"\0 CUR = 5 LEN = 6 Malformed UTF-8 character (unexpected non-continuation byte 0x00, imme +diately after start byte 0xc3) in pattern match (m//) at test.pl line + 17.` [download] Hurray ! Thanks once more!	[reply] [d/l]
Re^9: Malformed UTF-8 by Joost (Canon) on May 15, 2007 at 18:12 UTC
Note that that does mean your perl's unicode support is broken on this issue. It should not give that warning. It should just silently not match. You can probably get your program to work alright if you're careful, but I would still recommend you upgrade your perl to v5.8.8. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^10: Malformed UTF-8 by spiros (Beadle) on May 15, 2007 at 19:58 UTC