csthflk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I've got some code that I wrote while working on a Windows machine that does not work when used on Linux or MacOS. The key line of the code is this:
print OUT chr(charnames::vianame("$symbolName"));
When I run the program on Windows, the outputted file consists of correct Unicode characters. When I run the same program on Linux or MacOS, the file consists of gobbledy-gook. Whether I run on Windows, Mac, or Linux, the debug statements show the same exact sorts of results (as if it is doing the right thing):
Word is th;n
Writing t as GREEK SMALL LETTER TAU
Writing h; as GREEK SMALL LETTER ETA WITH VARIA
Writing n as GREEK SMALL LETTER NU

I prefer to work on MacOS or Linux rather than Windows so would like to figure out what the problem is. Below is the whole program. Thanks for any help.
use charnames ":full"; binmode(STDOUT, ":utf8"); %mapUnicode = (); open(MAP, "map2.txt") or die "!"; while(<MAP>) { next if (/^#/); next if ($_ !~ /[A-Z]/); chomp; if (length $_ > 0) { my @mapInfo = split / ### /; $mapUnicode{"$mapInfo[0]"} = $mapInfo[2]; } } close(MAP); open IN, "greekwords1.txt" or die "!"; open OUT, ">:utf8", "greekwords2.txt"; $buffer = ""; while(<IN>) { chomp; my $word = $_; print "\nWord is $word\n"; while($word =~ m/(.)/g) { my $newPart = $1; my $prospectiveUnit = "$buffer$newPart"; if (exists $mapUnicode{$prospectiveUnit}) { $buffer = $prospectiveUnit; } else { my $symbolName = $mapUnicode{$buffer}; print "Writing $buffer as $symbolName\n"; print OUT chr(charnames::vianame("$symbolName")); $buffer = "$newPart"; } } my $symbolName = $mapUnicode{$buffer}; print "Writing $buffer as $symbolName\n"; print OUT chr(charnames::vianame("$symbolName")); $buffer = ""; print OUT "\n"; } close IN; close OUT;

Replies are listed 'Best First'.
Re: unicode issues on Unix only
by kcott (Archbishop) on Nov 07, 2013 at 09:17 UTC

    G'day csthflk,

    Firstly, here's working code (written and run on Mac OS X) that does what you want. See the Notes at the end for details of what I did differently and why.

    #!/usr/bin/env perl use strict; use warnings; use autodie; use charnames ':full'; my $in_map = 'pm_unicode_1061453_map2.txt'; my $in_words = 'pm_unicode_1061453_greekwords1.txt'; my $out_greek = 'pm_unicode_1061453_greek_out.txt'; my $in_map_re = qr{^([^#]+)\s###[^#]+###\s([^#]+?)\s*$}; open my $in_map_fh, '<', $in_map; my %uni_map = map { /$in_map_re/ ? ($1 => $2) : () } <$in_map_fh>; close $in_map_fh; open my $in_words_fh, '<', $in_words; open my $out_greek_fh, '>:utf8', $out_greek; while (<$in_words_fh>) { chomp; my @word_chars = split ''; my $greek_word = ''; my $key = ''; while (@word_chars) { $key .= shift @word_chars; next unless exists $uni_map{$key}; next if @word_chars && exists $uni_map{join '' => $key, $word_ +chars[0]}; $greek_word .= charnames::string_vianame($uni_map{$key}); $key = ''; } die "Can't find charname for '$key'" if $key; print $out_greek_fh "$greek_word\n"; } close $in_words_fh; close $out_greek_fh;

    I downloaded the input files with wget. They have the same line ending discrepancy that graff noted (above).

    Here's the output. There's some issues with posting Unicode code with <code>...</code> tags; I've used <pre>...</pre> tags here.

    $ cat pm_unicode_1061453_greek_out.txt
    Θεωροῦντες
    δὲ
    τὴν
    τοῦ
    

    Notes:

    • Use strict and warnings in all your scripts. Turn off a limited subset of their functionality, in a limited scope, when it's unwanted and you understand what you're doing and why.
    • I've used autodie to trap I/O errors. I would recommend doing this, because it's much easier than the alternative and your script does not become littered with "... or die "Some custom message: $!;" code; if you choose not to do this, you'll need to handcraft every one of those yourself. Just looking at your open statements: you don't check whether one of them (OUT) worked at all; the other two (MAP and IN) have "... or die "!";" ('!' should be '$!' and there's no message).
    • Use lexical filehandles and the 3-argument form of open. See my code for examples and the doco for further examples and discussion.
    • map is often used to create a hash. As you can see, it uses a lot less code than your while loop. It's pretty straightforward, but ask if you don't understand some part of what I did here.
    • For generating the Unicode characters, I've used charnames::string_vianame(). This meant I didn't need an extra function (i.e. chr) to convert the code point to a string.
    • Note how I've only needed a single print statement to populate the output file. Whenever you find yourself writing the same (near) identical code, consider whether there's a better algorithm; if not, use a subroutine (one place to make mistakes, fixes, enhancements, etc.).
    • Depending on far along you are with your project, and whether you have control of the map2.txt file, you might like to look at charnames: CUSTOM ALIASES which would allow you to get rid of all that mapping code completely and just replace "use charnames ':full';" with "use charnames ':alias' => 'file';". It's a little more complicated than that and explained in the doco.

    -- Ken

      Thanks Ken, I appreciate the tips.
Re: unicode issues on Unix only
by graff (Chancellor) on Nov 07, 2013 at 03:19 UTC
    When I downloaded your sample data files, I noticed that "map2.txt" has CRLF line termination, while "greekwords1.txt" does not. Because of that, using chomp on osx/linux/unix doesn't do everything you want it to when you read the map file.

    Try using s/\s+$//; instead of chomp.

    (Curiously, when I first ran your script as-is on osx, with chomp, I didn't get "gobblede-gook" - I got nulls. But when I switched to removing all final white space, I got Greek.)

      Thanks, the line-ending issue was the cause of the problem.
Re: unicode issues on Unix only
by daxim (Curate) on Nov 06, 2013 at 17:55 UTC
    Please provide the input files so the program will run.

        You're not telling perl to treat the input file as utf8

        The posting system here keeps mangling the map file, no matter what conventions I use to post it

        perl -e " use Data::Dump; use Path::Tiny; dd( path( shift )->slurp_raw ) while @ARGV " file1.file file2.file