unicode issues on Unix only

csthflk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've got some code that I wrote while working on a Windows machine that does not work when used on Linux or MacOS. The key line of the code is this:

print OUT chr(charnames::vianame("$symbolName"));
[download]

When I run the program on Windows, the outputted file consists of correct Unicode characters. When I run the same program on Linux or MacOS, the file consists of gobbledy-gook. Whether I run on Windows, Mac, or Linux, the debug statements show the same exact sorts of results (as if it is doing the right thing):
Word is th;n
Writing t as GREEK SMALL LETTER TAU
Writing h; as GREEK SMALL LETTER ETA WITH VARIA
Writing n as GREEK SMALL LETTER NU

I prefer to work on MacOS or Linux rather than Windows so would like to figure out what the problem is. Below is the whole program. Thanks for any help.

use charnames ":full";
binmode(STDOUT, ":utf8");

%mapUnicode = ();
open(MAP, "map2.txt") or die "!";
while(<MAP>) {
    next if (/^#/);
    next if ($_ !~ /[A-Z]/);
    chomp;
    if (length $_ > 0) {
        my @mapInfo = split / ### /;
        $mapUnicode{"$mapInfo[0]"} = $mapInfo[2];
    }
}
close(MAP);

open IN, "greekwords1.txt" or die "!";
open OUT, ">:utf8", "greekwords2.txt";
$buffer = "";
while(<IN>) {
    chomp;
    my $word = $_;
    print "\nWord is $word\n";
    while($word =~ m/(.)/g) {
        my $newPart = $1;
        my $prospectiveUnit = "$buffer$newPart";
        if (exists $mapUnicode{$prospectiveUnit}) {
            $buffer = $prospectiveUnit;
        }
        else {
            my $symbolName = $mapUnicode{$buffer};
            print "Writing $buffer as $symbolName\n";
            print OUT chr(charnames::vianame("$symbolName"));
            $buffer = "$newPart";
        }
    }
    my $symbolName = $mapUnicode{$buffer};
    print "Writing $buffer as $symbolName\n";
    print OUT chr(charnames::vianame("$symbolName"));
    $buffer = "";
    print OUT "\n";
}
close IN;
close OUT;
[download]

Comment on unicode issues on Unix only Select or Download Code

Replies are listed 'Best First'.
Re: unicode issues on Unix only by kcott (Archbishop) on Nov 07, 2013 at 09:17 UTC
G'day csthflk, Firstly, here's working code (written and run on Mac OS X) that does what you want. See the Notes at the end for details of what I did differently and why. #!/usr/bin/env perl use strict; use warnings; use autodie; use charnames ':full'; my $in_map = 'pm_unicode_1061453_map2.txt'; my $in_words = 'pm_unicode_1061453_greekwords1.txt'; my $out_greek = 'pm_unicode_1061453_greek_out.txt'; my $in_map_re = qr{^([^#]+)\s###[^#]+###\s([^#]+?)\s$}; open my $in_map_fh, '<', $in_map; my %uni_map = map { /$in_map_re/ ? ($1 => $2) : () } <$in_map_fh>; close $in_map_fh; open my $in_words_fh, '<', $in_words; open my $out_greek_fh, '>:utf8', $out_greek; while (<$in_words_fh>) { chomp; my @word_chars = split ''; my $greek_word = ''; my $key = ''; while (@word_chars) { $key .= shift @word_chars; next unless exists $uni_map{$key}; next if @word_chars && exists $uni_map{join '' => $key, $word_ +chars[0]}; $greek_word .= charnames::string_vianame($uni_map{$key}); $key = ''; } die "Can't find charname for '$key'" if $key; print $out_greek_fh "$greek_word\n"; } close $in_words_fh; close $out_greek_fh; [download] I downloaded the input files with `wget`. They have the same line ending discrepancy that graff noted (above). Here's the output. There's some issues with posting Unicode code with `<code>...</code>` tags; I've used `<pre>...</pre>` tags here. $ cat pm_unicode_1061453_greek_out.txt Θεωροῦντες δὲ τὴν τοῦ Notes:* Use strict and warnings in all your scripts. Turn off a limited subset of their functionality, in a limited scope, when it's unwanted and you understand what you're doing and why. I've used autodie to trap I/O errors. I would recommend doing this, because it's much easier than the alternative and your script does not become littered with "`... or die "Some custom message: $!;`" code; if you choose not to do this, you'll need to handcraft every one of those yourself. Just looking at your `open` statements: you don't check whether one of them (`OUT`) worked at all; the other two (`MAP` and `IN`) have "`... or die "!";`" ('`!`' should be '`$!`' and there's no message). Use lexical filehandles and the 3-argument form of open. See my code for examples and the doco for further examples and discussion. map is often used to create a hash. As you can see, it uses a lot less code than your `while` loop. It's pretty straightforward, but ask if you don't understand some part of what I did here. For generating the Unicode characters, I've used charnames::string_vianame(). This meant I didn't need an extra function (i.e. chr) to convert the code point to a string. Note how I've only needed a single print statement to populate the output file. Whenever you find yourself writing the same (near) identical code, consider whether there's a better algorithm; if not, use a subroutine (one place to make mistakes, fixes, enhancements, etc.). Depending on far along you are with your project, and whether you have control of the `map2.txt` file, you might like to look at charnames: CUSTOM ALIASES which would allow you to get rid of all that mapping code completely and just replace "`use charnames ':full';`" with "`use charnames ':alias' => 'file';`". It's a little more complicated than that and explained in the doco. -- Ken	[reply] [d/l] [select]
Re^2: unicode issues on Unix only by csthflk (Novice) on Nov 07, 2013 at 19:00 UTC
Thanks Ken, I appreciate the tips.	[reply]
Re: unicode issues on Unix only by graff (Chancellor) on Nov 07, 2013 at 03:19 UTC
When I downloaded your sample data files, I noticed that "map2.txt" has CRLF line termination, while "greekwords1.txt" does not. Because of that, using chomp on osx/linux/unix doesn't do everything you want it to when you read the map file. Try using `s/\s+$//;` instead of chomp. (Curiously, when I first ran your script as-is on osx, with chomp, I didn't get "gobblede-gook" - I got nulls. But when I switched to removing all final white space, I got Greek.)	[reply] [d/l]
Re^2: unicode issues on Unix only by Anonymous Monk on Nov 07, 2013 at 18:58 UTC
Thanks, the line-ending issue was the cause of the problem.	[reply]
Re: unicode issues on Unix only by daxim (Curate) on Nov 06, 2013 at 17:55 UTC
Please provide the input files so the program will run.	[reply]
Re^2: unicode issues on Unix only by csthflk (Novice) on Nov 06, 2013 at 18:37 UTC
Hi daxim, The posting system here keeps mangling the map file, no matter what conventions I use to post it. Please try downloading at: http://www.perkinscentral.com/greekwords1.txt http://www.perkinscentral.com/map2.txt Thanks.	[reply]
Re^3: unicode issues on Unix only by Anonymous Monk on Nov 06, 2013 at 20:50 UTC
You're not telling perl to treat the input file as utf8 The posting system here keeps mangling the map file, no matter what conventions I use to post it perl -e " use Data::Dump; use Path::Tiny; dd( path( shift )->slurp_raw ) while @ARGV " file1.file file2.file	[reply]
Re^4: unicode issues on Unix only by csthflk (Novice) on Nov 06, 2013 at 21:47 UTC