comment on

I have a large utf-8 formatted file with a mix of English and Japanese, where the Japanese uses a particular format "baseword(reading)" to indicate how a word should be pronounced.

For instance: 漢字(かんじ) has "漢字" as baseword, and "かんじ" ('kanji') as reading.

I wanted to write a regular expression to convert this format to latex format instead, so that it would end up looking like "\ruby{baseword}{reading}". Using s/([^\w\s]+?)\(([^\w\s]+?)\)/\\ruby{\1}{\2}/g on a per-line basis works well, but runs into a problem when dealing with words that have multiple reading elements such as:

繰(く)り返(かえ)し

This would get converted into \ruby{繰}{く}\ruby{り返}{かえ}し, which is the correct behaviour as per regexp, but not the desired result given the text - the correct result should be:

\ruby{繰}{く}り\ruby{返}{かえ}し

as the main word 繰り返し is one word, with readings for the two complex characters 繰 and 返, but no readings for the already sound script り and し.

Since readings should only be added for characters that are in the Unicode "CJK unified ideograms" block (u4e00-u9fff), and readings are always written with characters from the Unicode "hiragana" block (u3040-u309f), I thought I'd use the following regexp instead:

s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{\1}{\2}/g

however, this doesn't seem to do anything at all - nothing gets matched. While a program like "reggy" claims this should work, actually running it through perl 5.10 doesn't convert anything =(

the program I'm using:

use utf8;
binmode STDIN, ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';
open(READ, "test.txt");
@lines = <READ>;
close(READ);
print "Found ".@lines." lines.\n";
open(WRITE, ">test-result.txt");
foreach (@lines)
{
    $_ =~ s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{
+\1}{\2}/g;
    print WRITE $_;
}
close(WRITE);
[download]

An example bit of data file:

This is English text with 日本語(にほんご) mixed in. To test multi-furi text: 繰(く)り返(かえ)し should lead to \ruby{繰}{く}り\ruby{返}{かえ}し.

Am I missing something terribly obvious? If so, what am I missing to make this work properly?

In reply to matching unicode blocks with regular expressions by Pomax

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.