I have a large utf-8 formatted file with a mix of English and Japanese, where the Japanese uses a particular format "baseword(reading)" to indicate how a word should be pronounced.

For instance: 漢字(かんじ) has "漢字" as baseword, and "かんじ" ('kanji') as reading.

I wanted to write a regular expression to convert this format to latex format instead, so that it would end up looking like "\ruby{baseword}{reading}". Using s/([^\w\s]+?)\(([^\w\s]+?)\)/\\ruby{\1}{\2}/g on a per-line basis works well, but runs into a problem when dealing with words that have multiple reading elements such as:

繰(く)り返(かえ)し

This would get converted into \ruby{繰}{く}\ruby{り返}{かえ}し, which is the correct behaviour as per regexp, but not the desired result given the text - the correct result should be:

\ruby{繰}{く}り\ruby{返}{かえ}し

as the main word 繰り返し is one word, with readings for the two complex characters 繰 and 返, but no readings for the already sound script り and し.

Since readings should only be added for characters that are in the Unicode "CJK unified ideograms" block (u4e00-u9fff), and readings are always written with characters from the Unicode "hiragana" block (u3040-u309f), I thought I'd use the following regexp instead:

s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{\1}{\2}/g

however, this doesn't seem to do anything at all - nothing gets matched. While a program like "reggy" claims this should work, actually running it through perl 5.10 doesn't convert anything =(

the program I'm using:

use utf8; binmode STDIN, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; open(READ, "test.txt"); @lines = <READ>; close(READ); print "Found ".@lines." lines.\n"; open(WRITE, ">test-result.txt"); foreach (@lines) { $_ =~ s/([\x{4E00}-\x{9FFF}]+?)\(([\x{3040}-\x{309F}]+?)\)/\\ruby{ +\1}{\2}/g; print WRITE $_; } close(WRITE);

An example bit of data file:

This is English text with 日本語(にほんご) mixed in. To test multi-furi text: 繰(く)り返(かえ)し should lead to \ruby{繰}{く}り\ruby{返}{かえ}し.

Am I missing something terribly obvious? If so, what am I missing to make this work properly?


In reply to matching unicode blocks with regular expressions by Pomax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.