UTF8 issues

kemuri has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody!

I programmed a shell-application in C for Linux wich conjugated japanese verbs using strings functions.

Wait wait! It's about perl, I swear.

I'd like to implement a function that converts hiragana into romaji, I mean,　さむらい　ー[function]ー＞　samurai

As it's a hard job for my lazyness, I thought someone would have programmed this before: and I was obviously right. But the one I liked was written in Perl. One day I'll do it in C, but meanwhile I need your help!

This is the project: Lingua::JA::Moji
And the function I'm interested in is: kana2romaji

Okay, finish of the presentation. It's been two days since I started flirting with Perl. But by now I can't afford learning enough to solve it without help.

What I want:
1 Read a text of n lines, each with one hiragana word
2 Convert it by kana2romaji
3 Print it the same way but in romaji in another file

The code I have:

use Lingua::JA::Moji qw/kana2romaji romaji2kana/;
#use utf8;
#use Encode;

if( $#ARGV < 1 ){
  die("Not enough arguments\n");
}
open(INP, "<$ARGV[0]") or die("Cannot open file '$ARGV[0]' for reading
+\n");
open(OUTP, ">$ARGV[1]") or die("Cannot open file '$ARGV[1]' for writin
+g\n");
my @hira = <INP>;
print OUTP kana2romaji (@hira);
close INP;
close OUTP;
[download]

But kana2romaji complains the input is not in Unicode, more accurately, it complains the Unicode flag is OFF. I found some options to solve the problem, but I haven't been able to manage it successfully:

Using Encode library

use Encode;
@hira = Encode::decode( 'utf8', @hira );#if
encode('utf8', decode('utf8', @hira), @hira));
@hira = decode("utf-8", @hira );
binmode @hira, ':encoding(utf8)';
[download]

Using utf8 library

use utf8;
_utf8_on(@hira);
[download]

I know more or less the differences between each function, but I get lost. The input file was created with gedit, so I guess it'll be unicoded... I got the ideas mainly from Encode

The program complain:
"Input is not flagged as unicode: conversion will fail. at file.pl line x"

I should pass some parameters to kana2romaji in order to get an injective function (I mean word to word, and not word to x possible words), but that's a detail.
I expect you will help me with my flash-travel to Perl, I'm sure that if I suceed I'll come back later :)
Thank you guys!

Comment on UTF8 issues Select or Download Code

Replies are listed 'Best First'.
Re: UTF8 issues by choroba (Cardinal) on Nov 14, 2010 at 23:10 UTC
Try `open(INP, '<:utf8',$ARGV[0]) or die...` [download] Update: See open, open.	[reply] [d/l]
Re: UTF8 issues by ikegami (Patriarch) on Nov 15, 2010 at 00:34 UTC
`decode` works, except it takes a scalar `my @encoded_hira = <INP>; my @decoded_hira = map { decode('UTF-8', $_) } @encoded_hira; my @decoded_romaji = kana2romaji(@decoded_hira); my @encoded_romaji = map { encode('UTF-8', $_) } @decoded_romaji; print OUTP @encoded_romaji;` [download] `binmode` works, except it takes a file handle. `binmode INP, ':encoding(UTF-8)'; binmode OUTP, ':encoding(UTF-8)';` [download] But the simplest way is to pass that directive to `open`. `open(INP, '<:encoding(UTF-8)', $ARGV[0]) open(OUTP, '>:encoding(UTF-8)', $ARGV[1])` [download] `use utf8;` tells Perl the Perl source is UTF-8. Not relevant here. `_utf8_on` is an unsafe version of `decode`. Like `decode`, it would have worked if you had used it properly. (Note that the module is wrong to check for the UTF8 flag. Informally, this is called "The Unicode Bug". It's trying to detect if you made an error, but it can incorrectly flag valid inputs as errors.)	[reply] [d/l] [select]
Re: UTF8 issues by Jim (Curate) on Nov 15, 2010 at 02:21 UTC
Because you admitted unabashedly you're new to Perl, and because you pleaded very graciously for our help, I've taken the liberty of refactoring your script a tiny bit. I've incorporated idioms of Modern Perl and a few Perl best practices gleaned from Perl Best Practices. #!perl # # hiragana2romaji.pl - Convert hiragana text to romaji text use strict; use warnings; use autodie; use Lingua::JA::Moji qw( kana2romaji ); @ARGV == 2 or die "Usage: perl $0 <hiragana file> <romaji file>\n"; my $hiragana_file = shift @ARGV; my $romaji_file = shift @ARGV; open my $hiragana_fh, '<:encoding(UTF-8)', $hiragana_file; open my $romaji_fh, '>:encoding(UTF-8)', $romaji_file; while (my $hiragana_text = <$hiragana_fh>) { chomp $hiragana_text; my $romaji_text = kana2romaji($hiragana_text); print {$romaji_fh} "$romaji_text\n"; } close $hiragana_fh; close $romaji_fh; exit 0; [download]	[reply] [d/l]
Re: UTF8 issues by Jim (Curate) on Nov 15, 2010 at 03:57 UTC
I tested the Perl script on hiragana text in Unicode Normalization Form D as well as in Normalization Form C. It doesn't handle NFD properly. For this input… NFC どらいどまんごす NFD どらいどまんごす …it generates this output… NFC doraidomangosu NFD to゙raito゙manko゙su …which isn't right. The module Lingua::JA::Moji isn't accounting for the possibility of Japanese kana in their decomposed form (NFD). This may not matter to you for your application, but I felt it was worth pointing out to you in any case. If it does matter to you, then you might suggest to the author of Lingua::JA::Moji that he use the core module Unicode::Normalize to normalize the characters to NFC prior to converting them from kana to romaji. Normalization Form C (NFC) `U+3069 HIRAGANA LETTER DO U+3089 HIRAGANA LETTER RA U+3044 HIRAGANA LETTER I U+3069 HIRAGANA LETTER DO U+307E HIRAGANA LETTER MA U+3093 HIRAGANA LETTER N U+3054 HIRAGANA LETTER GO U+3059 HIRAGANA LETTER SU` [download] Normalization Form D (NFD) `U+3068 HIRAGANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+3089 HIRAGANA LETTER RA U+3044 HIRAGANA LETTER I U+3068 HIRAGANA LETTER TO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+307E HIRAGANA LETTER MA U+3093 HIRAGANA LETTER N U+3053 HIRAGANA LETTER KO U+3099 COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK U+3059 HIRAGANA LETTER SU` [download]	[reply] [d/l] [select]

Normalization Form C (NFC)

Normalization Form D (NFD)