in reply to Re^4: Question about Encode module and CHECK parameter
in thread Question about Encode module and CHECK parameter

I am somehow embarrassed that I didn't look into the source code myself. My excuse is that I was convinced that somebody out there already had a list showing which value SUBCHAR is under which circumstances.

In a perfect world, you shouldn't have to look into the source, everything would be explained in the documentation.

Regarding using a coderef for CHECK, I think I had got that in the meantime, but IMHO the question remains if a malformed character has an ordinal value at all (quite sure yes, but which?).

I would assume that - for converting bytes pretending to be UTF-8 to perl's internal representation - the CHECK coderef would be called with the value of the (first) offending byte. For the other way (perl to UTF-8 bytestream), I would expect to get the (first) offending perl character that can't be expressed as UTF-8 bytestream.

Let's test that:

#!/usr/bin/perl use v5.10; use strict; use warnings; use Encode 2.12 qw(); my $octets="a\xFEb"; # ^-- byte 0xFE ist invalid in UTF-8, see https://en.wikipe +dia.org/wiki/Utf-8 my $string=Encode::decode( 'utf-8', $octets, sub { my $value=shift; return sprintf('<0x%04X>',$value); } ); say $string; $string="a\x{123456}b"; # ^-- Unicode is defined from 0 to 0x10FFFF $octets=Encode::encode( 'utf-8', $string, sub { my $value=shift; return sprintf('<0x%08X>',$value); } ); say $octets; $string="a\x{00C4}b\x{263A}c"; # a A-Umlaut b Smile c # ^-- not available in ISO-8859-1 $octets=Encode::encode( 'utf-8', $string, sub { die "Should not happen"; } ); # from_to() converts bytes, not characters. To make things easier, # I use a destination encoding where 1 byte = 1 character. Encode::from_to( $octets, # in-place 'utf-8', 'iso-8859-1', sub { my $value=shift; return sprintf('<0x%04X>',$value); } ); binmode STDOUT,':encoding(utf-8)'; # I use a UTF-8 terminal say $octets; # implicit converting from ISO-8859-1 to UTF-8 due to bin +mode above

Output:

a<0x00FE>b a<0x00123456>b aÄb<0x263A>c
I have asked question 2) mainly because I felt that not explaining SUBCHAR is a substantial lack of documentation and was hoping that I had overlooked something.

Yes, the documentation could be improved. If you can spend five minutes, file a bug. If you can spend an hour or two, create a patch for the POD and submit it. You know now quite well what's missing in the POD, and how it should be explained. Perhaps post a preview for discussion here.

(If you are working for a boss (and not for fun), explain him/her that this little bit of time is a kind of "usage fee" for the huge amount of well-written code you use from the perl community. That was my argument for publishing the initial Unicode patch for DBD::ODBC, and my boss was quite happy with that. We had the patch, we needed it, and by publishing it, the Unicode support became even better. And the best: I don't have to support it any more. mje has merged it into DBD::ODBC and improved it since then.)

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^6: Question about Encode module and CHECK parameter
by Nocturnus (Scribe) on Aug 16, 2015 at 07:35 UTC

    Thank you very much again for the sample code. I am convinced that it will be very useful to many other people, too.

    I am not working just for fun, but I am my own boss, so I guess it will not be too difficult to convince the boss that making patches is worth the time and effort (I occasionally have done such things in the past for other projects, too). I'll look into the POD format and the patch submitting process ...

    Thanks again,

    Nocturnus