strange utf-8 (I think) behaviour...

seekay has asked for the wisdom of the Perl Monks concerning the following question:

Hello there!

Whilst not being new to developing I am very new to Perl. I have "inherited" a project, and needed to extend it to allow for utf-8 data. I have pretty much completed this task, but there is one last remaining problem.

The software sends email to a list of users, the name could contain accented or other European or Asian characters. The user information is read from a mysql db.

The administrators of the software create an email template and use place holders like %NAME% which the code then replaces with the real name of the user. This is where my problem seems to be.

I use Encode to to encode the address and subject successfully for the email.:

$addr="$name <$email>";
use Encode qw/encode/;
$to = encode('MIME-Q', $to);
$subject = encode('MIME-Q', $subject);
[download]

The body of the email is presented in utf-8 no problem, the name that is replaced in the template is done so via :

$tmpbody=~s/\%NAME\%/$name/sg;
[download]

The problem is that the name when displayed in the body of the email contains the familiar question marks or diamonds where the utf-8 multibyte characters should be.

Some additional background, after MANY trawls through the web the db table's default charset is utf-8, I have checked the file used to hold the email template and verified it is utf-8 (using file file_name at a linux command prompt), I also have done $dbh->{'mysql_enable_utf8'} = 1; on the database connection. For good measure I also decode_utf($name) before using it in either the address or the string replace line.

Adding some additional debug info to the body of the email using :

$extra1 = DBI::data_string_desc($name);
$extra2 = DBI::data_string_desc($body);
$tmpbody=~s/\%NAME\%/$name."<br>".$extra1."<br>".$extra2/sg;
[download]

Shows that both the name and body have UTF8 :
Dear b�l�." ".UTF8 on, non-ASCII, 4 characters 6 bytes." ".UTF8 on, non-ASCII, 606 characters 619 bytes

I have tried lots of different remedies on the above from using $name = pack "U0C*", unpack "C*", $name; to using decode or encode in lots of different combinations, but once I reported the info above I stopped and scratched my head as it *looked* like everything should have worked!

Any help will be very much appreciated I've spent a LOT of hours trying to figure this out!

Comment on strange utf-8 (I think) behaviour... Select or Download Code

Replies are listed 'Best First'.
Re: strange utf-8 (I think) behaviour... by moritz (Cardinal) on Oct 21, 2008 at 09:28 UTC
I think the best approach is to use an email handling module like Mime::Lite that does the encoding for your. Your problem is most likely that you are mixing decoded and encoded strings. Don't ever do that. The "right" approach is to decode everything that comes from the outside, build your string from the template, and then pass this decoded string to Mime::Lite. For more information on general Unicode handling in Perl see character encodings in perl, perluniintro, perlunicode, perlunifaq and Encode.	[reply]
Re^2: strange utf-8 (I think) behaviour... by seekay (Initiate) on Oct 21, 2008 at 10:29 UTC
Thanks for your comment Moritz! As a complete Perl newb I'd be a bit weary of building a new subroutine, so far I've only really tweaked what's there. Now, I remember reading somewhere that decoding in Perl was actually the same as encoding in utf-8 as that's what perl uses for its internals, in order to decode everything I'm reading from the template and the db is it as simple as decoding each element individually before using it? Would I then need to re-encode it or should setting the charset in the content type for the email be sufficient? Thanks again for your help...	[reply]
Re^3: strange utf-8 (I think) behaviour... by moritz (Cardinal) on Oct 21, 2008 at 11:54 UTC
Now, I remember reading somewhere that decoding in Perl was actually the same as encoding in utf-8 Not true. `use strict; use warnings; use utf8; use Encode qw(encode_utf8); binmode STDOUT, ':encoding(UTF-8)'; my $s = "sämple\n"; print uc $s; print uc encode_utf8($s); __END__ Output: SÄMPLE SÃ¤MPLE` [download] Note that the `use utf8;` implicitly decodes string constants in this program. is it as simple as decoding each element individually before using it? Yes. Would I then need to re-encode it or should setting the charset in the content type for the email be sufficient? You should give your decoded string to a module that does the encoding for you. No need to re-invent the wheel.	[reply] [d/l] [select]
Re^4: strange utf-8 (I think) behaviour... by seekay (Initiate) on Oct 24, 2008 at 02:11 UTC