in reply to unicode issues on Unix only

G'day csthflk,

Firstly, here's working code (written and run on Mac OS X) that does what you want. See the Notes at the end for details of what I did differently and why.

#!/usr/bin/env perl use strict; use warnings; use autodie; use charnames ':full'; my $in_map = 'pm_unicode_1061453_map2.txt'; my $in_words = 'pm_unicode_1061453_greekwords1.txt'; my $out_greek = 'pm_unicode_1061453_greek_out.txt'; my $in_map_re = qr{^([^#]+)\s###[^#]+###\s([^#]+?)\s*$}; open my $in_map_fh, '<', $in_map; my %uni_map = map { /$in_map_re/ ? ($1 => $2) : () } <$in_map_fh>; close $in_map_fh; open my $in_words_fh, '<', $in_words; open my $out_greek_fh, '>:utf8', $out_greek; while (<$in_words_fh>) { chomp; my @word_chars = split ''; my $greek_word = ''; my $key = ''; while (@word_chars) { $key .= shift @word_chars; next unless exists $uni_map{$key}; next if @word_chars && exists $uni_map{join '' => $key, $word_ +chars[0]}; $greek_word .= charnames::string_vianame($uni_map{$key}); $key = ''; } die "Can't find charname for '$key'" if $key; print $out_greek_fh "$greek_word\n"; } close $in_words_fh; close $out_greek_fh;

I downloaded the input files with wget. They have the same line ending discrepancy that graff noted (above).

Here's the output. There's some issues with posting Unicode code with <code>...</code> tags; I've used <pre>...</pre> tags here.

$ cat pm_unicode_1061453_greek_out.txt
Θεωροῦντες
δὲ
τὴν
τοῦ

Notes:

-- Ken

Replies are listed 'Best First'.
Re^2: unicode issues on Unix only
by csthflk (Novice) on Nov 07, 2013 at 19:00 UTC
    Thanks Ken, I appreciate the tips.