Pomax has asked for the wisdom of the Perl Monks concerning the following question:
I have a large utf-8 formatted file with a mix of English and Japanese, where the Japanese uses a particular format "baseword(reading)" to indicate how a word should be pronounced.
For instance: 漢字(かんじ) has "漢字" as baseword, and "かんじ" ('kanji') as reading.
I wanted to write a regular expression to convert this format to latex format instead, so that it would end up looking like "\ruby{baseword}{reading}". Using s/([^\w\s]+?)\(([^\w\s]+?)\)/\\ruby{\1}{\2}/g on a per-line basis works well, but runs into a problem when dealing with words that have multiple reading elements such as:
繰(く)り返(かえ)し
This would get converted into \ruby{繰}{く}\ruby{り返}{かえ}し, which is the correct behaviour as per regexp, but not the desired result given the text - the correct result should be:
\ruby{繰}{く}り\ruby{返}{かえ}し
as the main word 繰り返し is one word, with readings for the two complex characters 繰 and 返, but no readings for the already sound script り and し.
Since readings should only be added for characters that are in the Unicode "CJK unified ideograms" block (u4e00-u9fff), and readings are always written with characters from the Unicode "hiragana" block (u3040-u309f), I thought I'd use the following regexp instead:
s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{\1}{\2}/ghowever, this doesn't seem to do anything at all - nothing gets matched. While a program like "reggy" claims this should work, actually running it through perl 5.10 doesn't convert anything =(
the program I'm using:
use utf8; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; open(READ, "test.txt"); @lines = <READ>; close(READ); print "Found ".@lines." lines.\n"; open(WRITE, ">test-result.txt"); foreach (@lines) { $_ =~ s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{ +\1}{\2}/g; print WRITE $_; } close(WRITE);
An example bit of data file:
This is English text with 日本語(にほんご) mixed in. To test multi-furi text: 繰(く)り返(かえ)し should lead to \ruby{繰}{く}り\ruby{返}{かえ}し.
Am I missing something terribly obvious? If so, what am I missing to make this work properly?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: matching unicode blocks with regular expressions
by moritz (Cardinal) on Jan 08, 2009 at 21:16 UTC | |
by Pomax (Initiate) on Jan 08, 2009 at 21:37 UTC | |
by graff (Chancellor) on Jan 09, 2009 at 03:17 UTC | |
by zentara (Cardinal) on Jan 08, 2009 at 21:48 UTC | |
by Pomax (Initiate) on Jan 08, 2009 at 21:59 UTC |