in reply to Re^6: Problem with russian / cyrillic in e-mail program.
in thread Problem with russian / cyrillic in e-mail program.

Your string is (HTML) entity-encoded (which you would have also seen by printing it. Either that's this webmail hoster doing this, or it is your program which sends it. I really wonder why you are trying to send email without knowing what your text is in. And why you are scraping the text content from some external web site.

So, your first step would be to make sure you know what encoding your subject string is in. If it is HTML entities, then HTML::Entities::decode can turn that into an UTF-8 string, which you can then in turn Base64-encode for the MIME subject header. But in reality, it would be much better to eliminate the HTML part of the equation and directly set the subject to some well-known characters in a well-known encoding yourself directly, for example by using:

my $subject = 'Hello World';

except using cyrillic charset, potentially in utf8 or KOI-8:

use utf8; my $subject = 'Hello World'; # except in cyrillic

or

use Encoding 'KOI-8'; my $subject = 'Hello World'; # except in cyrillic

if your source code editor supports KOI-8.

Replies are listed 'Best First'.
Re^8: Problem with russian / cyrillic in e-mail program.
by dbmathis (Scribe) on Apr 04, 2010 at 21:24 UTC

    I was scraping from that site because I needed russian text. Someone was complaining that like 1% of the e-mails being sent out were not right when russian characters were entered into the form. I tried the print already (non cyrillic ) and the program itself is not introducing HTML Entities. I have also been testing in thunderbird and gmail. Both end up with the odd subjects.

    From the sound of it, it seems like I need to figure out what charset people are using every time. I guess it could be any charset considering people around the globe use the form.

    I will just start debugging more and maybe I will find something. Thanks for your help, and if you can think of anything else let me know.

      So, to recap, you don't control what character sets you are getting from your HTML form submission. This has very little to do with the way you're sending mail, and everything to do with how HTTP does not specify the character set it is sending form data in. The usual approach to solving that problem is to add an explicit encoding to every HTML page containing a form and hoping that the browser will use that encoding when sending the data. Some browsers also send some header indicating the character set.