Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

In sending emails to our Russian members I have the email send as UTF-8 in the header... it works fine, however whitespace is converted into this weird character: “3 â€

it appears to be that it is caused by whitespace in the acryllic language. so I was told I had to convert all whitespace to this:
 
however when I do this, it also converts the whitespace inside the html tags thus making it not work. so how can I swap the whitespace, which I do like this:
my $_swapWithText = ' '; $__htmlEmailText =~ s/ /$_swapWithText/eg;
How can I do this ONLY for stuff NOT IN html tags? I know how to swap out only html, but cannot figure out how to swap whitespace that is in the whole variable except what is in side the html tags...

Your wisdom will be greatly appreciated.

Richard

Replies are listed 'Best First'.
Re: Problem with UTF-8 email
by graff (Chancellor) on Jul 27, 2010 at 00:58 UTC
    I'll agree with zwon -- you haven't given enough information. It would help if you could show a snippet of input data and resulting output, along with a minimal (but runnable) perl script that shows how you converted that input to that output.

    It'll also help if you present the data in the form of hex byte values. The thing that you say is the "weird character" that whitespace is converted to is not a "character" -- it's a string of up to 14 bytes, two of which are in the ASCII range ("3" and space, which happen to be adjacent in the string: ... 0x33 0x20 ...). I won't hazard a guess as to what the other bytes might represent, because I'm not even sure whether the string you posted is an accurate copy of the output you got. (That's why the hex byte-value dump is important.)

    So, your question can't be answered yet, because you haven't shown us what the input looked like or what sort of code is creating the output. And it's not even clear what the output really is.

    In case it helps, here's a perl one-liner for generating a simple hex byte dump:

    perl -lpe '$_=unpack("H*",$_);s/(..)/$1 /g'
Re: Problem with UTF-8 email
by zwon (Abbot) on Jul 27, 2010 at 00:23 UTC

    Cyrillic languages use exactly the same whitespace as other languages, no need to convert it to   in html code. So the real question is why are whitespaces in your emails turn to garbage, and you have not provided us with enough information to answer that question. How do you sending these emails?

      Probably character encoding mismatch, or MIME encoding mismatch, or both
Re: Problem with UTF-8 email
by roboticus (Chancellor) on Jul 27, 2010 at 00:34 UTC

    Without seeing code, I can only speculate. However,   is an HTML encoding of a space, not a Unicode encoding. I *rarely* use Unicode, but I suspect that you're mixing up your symbol sets, missing or extra encode/decode calls, etc. I'd start by making sure you have a Unicode-compatible command window, and start your program in the debugger and look things over as you work your way through the program.

    I'd also suggest reading the docs: perlunicode, perlunifaq, perluniintro, and/or perlunitut.

    ...roboticus

Re: Problem with UTF-8 email
by Gangabass (Vicar) on Jul 27, 2010 at 08:33 UTC

    I use this code to send russian emails:

    use MIME::Lite; use MIME::Base64; #MIME::Lite->send( 'smtp', '192.168.0.22', AuthUser => "ki +njo", AuthPass => "password", Debug => 0 ); MIME::Lite->send( 'smtp', '192.168.0.22', Debug => 0 ); my $fr_title = encode( 'MIME-Header', $from_title ); my $msg = MIME::Lite->new( From => qq{"$fr_title" <$from_address>}, To => qq{<$email_address>}, Subject => encode( 'MIME-Header', $message_subject ), Type => 'multipart/related', ); $msg->attach( Type => 'text/html; charset=UTF-8', Data => $message_html, ); my $attachment_name = encode( 'MIME-Header', $attachment_t +itle . ".doc" ); #my $attachment_title_temp = $attachment_title; $msg->attach( Type => 'application/msword', Path => $base_path . '\\' . $dir_name . '\\' . $fi +le_name . ".doc", Filename => $attachment_name, ); $msg->attr('content-type.charset' => 'UTF-8'); $msg->send;
Re: Problem with UTF-8 email
by afoken (Chancellor) on Jul 27, 2010 at 11:14 UTC
    so I was told I had to convert all whitespace to &nbsp;

    That is plain nonsense. &nbsp; is a non-breaking space, not ordinary white space that is also used with cyrillic text.

    it works fine, however whitespace is converted into this weird character: “3 â€

    You clearly have an encoding problem. Show us the code sending the e-mail. Also show us the source of a mail that was generated by the code and shows the problem.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)