cbingel has asked for the wisdom of the Perl Monks concerning the following question:

I can't seem to wrap my brain around how Perl traslates Unicode. I'm trying to gather SMTP addresses from a source, and interject them into a Mailsweeper config file. Mailsweeper, however, seems to store it's config files in Unicode, rather than ASCII. Therefore, when the following script is run, all of the data appears to be there, there's just a blank space in between every character, except for the portion that I interjected. Help is much appreciated.
$filesenders = "z:/train/whitesenders.txt"; $fileaddrlist = "z:/Program Files/Mailsweeper for SMTP/Config/shared/a +ddrlist.test"; #Read and arrayify list of SMTP addresses open SENDERSREAD, $filesenders or die "unable to open sender file" ; foreach $newsender (<SENDERSREAD>) { chomp $newsender; push @senders, $newsender; } #Open MailSweeper Addrlist.cfg and pull address lists into a hash of a +rrays open (CONFIGREAD, "<:utf8", $fileaddrlist) or die "UNable to open conf +ig file for read"; print "Reading Config File...\n"; foreach (<CONFIGREAD>) { chomp; if (/\[(.*)\]/) { $currhead = $1; print "$currhead\n"; next; } next if (!(/./)); #Check for and omit blank lines push @{$config{$currhead}}, $_; } close CONFIGREAD; #Put senders captured from file into appropriate list foreach (@senders) { push @{$config{'AddressList\Scripted Whitelist'}}, "v:Member=\$S\"$_ +\""; } #write updated config to file, overwriting existing file open (CONFIGWRITE, ">:encoding(UTF-8)", $fileaddrlist) or die "unable +to open config file for write"; foreach (keys %config) { print CONFIGWRITE "[$_]\n"; foreach (@{$config{$_}}) { print CONFIGWRITE $_; print CONFIGWRITE "\n"; } print CONFIGWRITE "\n"; } close CONFIGWRITE;
===== Or, to make it even simpler, how do I get this snippet to print out a copy of the config file:
open (READ, "z:/test.txt") or die "unable to open file"; foreach (<READ>) { push @array, $_; } foreach (@array) { print $_; }
If requested, I can make a sample of the source file available. ===== OK, further work, further progress. I've finally determined that if I read the file in with UTF-16LE encoding, I get the strings in usable form. But if I output the fule using :encoding(UTF-16LE) I'm back to having a space between every character. Is this because Perl uses UTF-8 internally? If so, do I maybe have to use Unicode::String to convert the strings before sending them back to the filehandle?

Replies are listed 'Best First'.
Re: Reading and writing to unicode files
by paulbort (Hermit) on Mar 04, 2004 at 18:53 UTC
    Why did you write one open as "<:utf8", but the other as ">:encoding(utf8)"? I don't see ":encoding" in 3rd edition Camel, but it admits to being incomplete in this regard (p.754).

    If there's a null (0x0) before each otherwise normal ASCII character, that sounds more like UTF-16 than UTF-8. See the difference explained in a post on w3.org.

    So I would recommend trying ">:utf16", and see if you get what you want. Good Luck!

    --
    Spring: Forces, Coiled Again!
      I tried :utf16, but it wouldn't open the file. DO I need to turn that on somewhere?
Re: Reading and writing to unicode files
by tbone1 (Monsignor) on Mar 04, 2004 at 18:07 UTC
    Hm, the only thing I can see that MIGHT be an issue is in the chomps. From your values assigned to $filesenders and $fileaddrlist in the first two line, my guess is that you are on a Windows system. Windows uses two characters for its 'end of line character', these being "\r\n" IIRC, whereas Unix uses one, "\n". I am unfamiliar with the Windows implementations of Perl, but I wonder if the chomps are taking off both characters, or just the \n, and if those 'spaces' are actually return characters, aka \r.

    Just a thought.

    --
    tbone1, YAPS (Yet Another Perl Schlub)
    And remember, if he succeeds, so what.
    - Chick McGee

Re: Reading and writing to unicode files
by iburrell (Chaplain) on Mar 04, 2004 at 21:06 UTC
    The important question is how the MailSweeper files are encoded. "Unicode" is not an answer to that question because there are multiple, completely different encodings for Unicode characters to bytes. Perl uses UTF-8 internally for Unicode strings. UTF-16 stores Unicode characters as 2-bytes. Windows will use UTF-16 for text files.

    If you are getting an extra byte between ASCII characters, then the file is most likely using UTF-16. Try this:

    open (CONFIGREAD, "<:encoding(UTF-16)", $fileaddrlist)
      hmm....been pondering myself the question...is there a utility to detect which unicode standard a file is? is there such a thing?...how about unix "file" or "type" utilities? is there anything as such in perl to detect the "magic" of the file?

        You should be able to do it by looking at the first two bytes to detect UCS2/UTF-16;

        A UCS2 file would normally have a Byte Order Marker (BOM) Which is *either* 0xFF 0xFE or 0xFE 0xFF (depending on the endian-ness of your processor.

        Now you fall into the realm of multibyte. In the parsers I've designed, where there is no other meta data (like an XML prolog, or some encoding information from HTTP) is to firstly attempt to parse as UTF8 (it's dead handy that the conversion is relatively trival) -- if it fails (so it finds an invalid UTF-8 character sequence) I fall back to the current locale.

        * Actually Windows boxen use UCS2, not UTF-16; Strictly speaking, UTF-16 is a superset of this, and the UTF-16 is binary backwardly compatible with UCS2.