matching unicode blocks with regular expressions

Pomax has asked for the wisdom of the Perl Monks concerning the following question:

I have a large utf-8 formatted file with a mix of English and Japanese, where the Japanese uses a particular format "baseword(reading)" to indicate how a word should be pronounced.

For instance: 漢字(かんじ) has "漢字" as baseword, and "かんじ" ('kanji') as reading.

I wanted to write a regular expression to convert this format to latex format instead, so that it would end up looking like "\ruby{baseword}{reading}". Using s/([^\w\s]+?)$([^\w\s]+?)$/\\ruby{\1}{\2}/g on a per-line basis works well, but runs into a problem when dealing with words that have multiple reading elements such as:

繰(く)り返(かえ)し

This would get converted into \ruby{繰}{く}\ruby{り返}{かえ}し, which is the correct behaviour as per regexp, but not the desired result given the text - the correct result should be:

\ruby{繰}{く}り\ruby{返}{かえ}し

as the main word 繰り返し is one word, with readings for the two complex characters 繰 and 返, but no readings for the already sound script り and し.

Since readings should only be added for characters that are in the Unicode "CJK unified ideograms" block (u4e00-u9fff), and readings are always written with characters from the Unicode "hiragana" block (u3040-u309f), I thought I'd use the following regexp instead:

s/([\x{4E00}-\x{9FFF}]+?)$([\x{3040}-\x{309F}]+?)$/\\ruby{\1}{\2}/g

however, this doesn't seem to do anything at all - nothing gets matched. While a program like "reggy" claims this should work, actually running it through perl 5.10 doesn't convert anything =(

the program I'm using:

use utf8;
binmode STDIN, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
open(READ, "test.txt");
@lines = <READ>;
close(READ);
print "Found ".@lines." lines.\n";
open(WRITE, ">test-result.txt");
foreach (@lines)
{
    $_ =~ s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{
+\1}{\2}/g;
    print WRITE $_;
}
close(WRITE);
[download]

An example bit of data file:

This is English text with 日本語(にほんご) mixed in. To test multi-furi text: 繰(く)り返(かえ)し should lead to \ruby{繰}{く}り\ruby{返}{かえ}し.

Am I missing something terribly obvious? If so, what am I missing to make this work properly?

Comment on matching unicode blocks with regular expressions Select or Download Code

Replies are listed 'Best First'.
Re: matching unicode blocks with regular expressions by moritz (Cardinal) on Jan 08, 2009 at 21:16 UTC
Instead of testing for ranges of codepoints, it's usually safer (and much more readable) to test for the Script property. perlunicode lists the available scripts; judging from your description, I think you need this: `s/(\p{Han}+?)$(\p{Hiragana}+?)$/\\ruby{\1}{\2}/g;` [download] Running your code with this regex yields This is English text with \ruby{日本語}{にほんご} mixed in. To test multi-furi text: \ruby{繰}{く}り\ruby{返}{かえ}し Which I hope is correct. Update: the output above was produced with perl-5.8.8 on Linux, and can be reproduced with perl-5.10.0. I used the script below (the code tags of perlmonks will kill the example input, though): `use utf8; binmode DATA, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while (<DATA>) { $_ =~ s/(\p{Han}+?)$(\p{Hiragana}+?)$/\\ruby{\1}{\2}/g; print; } __DATA__ This is English text with 日本語(にほ&#1 +2435;ご) mixed in. To test multi-furi text: 繰(く)& +#12426;返(かえ)し # should lead to \ruby{繰}{く}り\ruby{返}{&#12 +363;え}し.` [download]	[reply] [d/l] [select]
Re^2: matching unicode blocks with regular expressions by Pomax (Initiate) on Jan 08, 2009 at 21:37 UTC
That's very curious... using your suggestion still yields no result on my machine, using activestate's activeperl 5.10.0 for x86 windows. I guess I'll see if installing strawberry perl makes a difference =/ update Made no difference, it still won't play nice =( 2nd update It actually does work, but the pl file itself had not saved itself in utf-8 format. hurray for annoying little 'last thing you think of' problems. Thanks for the help, moritz! 3rd update actually, it doesn't work. While the inline example you gave works fine (relying on __DATA__), moving the data to a file called "test.txt", saved in utf-8 encoding, and then running the code again on that file instead, fails again. `use utf8; open(READ,"test.txt"); @lines = <READ>; close(READ); foreach (@lines) { $_ =~ s/(\p{Han}+?)$(\p{Hiragana}+?)$/\\ruby{\1}{\2}/g; print; } # force file to save as unicode: 日本語` [download] text file: This is English text with 日本語(にほんご) mixed in. To test multi-furi text: 繰(く)り返(かえ)し should lead to \ruby{繰}{く}り\ruby{返}{かえ}し. resulting text: This is English text with 日本語(にほんご) mixed in. To test multi-furi text: 繰(く)り返(かえ)し should lead to \ruby{繰}{く}り\ruby{返}{かえ}し. ... any ideas? =/	[reply] [d/l]
Re^3: matching unicode blocks with regular expressions by graff (Chancellor) on Jan 09, 2009 at 03:17 UTC
I may be missing the point here (I'm not a windows user either), but in the code snippet you showed in your "3rd update", I didn't see any evidence that the file handle was getting a utf8 IO layer (or alternatively, that the data being read in was getting "decoded" into perl-internal utf8). Did you try doing the open like this? `open( READ, "<:utf8", "test.txt" );` [download] You should also be setting the utf8 IO layer on your output file handle too: `binmode STDOUT, ":utf8";` [download] I gather that your "resulting text" should have been different from the input text, but wasn't. This would be expected if your regex substitutions are based on unicode classes, but perl doesn't know that the strings are perl-internal utf8 unicode.	[reply] [d/l] [select]
Re^3: matching unicode blocks with regular expressions by zentara (Cardinal) on Jan 08, 2009 at 21:48 UTC
I don't use Windows, but Isn't it's standard encoding UTF-16? Not UTF-8. Cannot copy files when filenames got French accentuated character might be helpful. I'm not really a human, but I play one on earth Remember How Lucky You Are	[reply]
Re^4: matching unicode blocks with regular expressions by Pomax (Initiate) on Jan 08, 2009 at 21:59 UTC