in reply to matching unicode blocks with regular expressions
s/(\p{Han}+?)\((\p{Hiragana}+?)\)/\\ruby{\1}{\2}/g;
Running your code with this regex yields
This is English text with \ruby{日本語}{にほんご} mixed in. To test multi-furi text: \ruby{繰}{く}り\ruby{返}{かえ}し
Which I hope is correct.
Update: the output above was produced with perl-5.8.8 on Linux, and can be reproduced with perl-5.10.0. I used the script below (the code tags of perlmonks will kill the example input, though):
use utf8; binmode DATA, ':encoding(UTF-8)'; binmode STDOUT, ':encoding(UTF-8)'; while (<DATA>) { $_ =~ s/(\p{Han}+?)\((\p{Hiragana}+?)\)/\\ruby{\1}{\2}/g; print; } __DATA__ This is English text with 日本語(にほ +2435;ご) mixed in. To test multi-furi text: 繰(く)& +#12426;返(かえ)し # should lead to \ruby{繰}{く}り\ruby{返}{ +363;え}し.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: matching unicode blocks with regular expressions
by Pomax (Initiate) on Jan 08, 2009 at 21:37 UTC | |
by graff (Chancellor) on Jan 09, 2009 at 03:17 UTC | |
by zentara (Cardinal) on Jan 08, 2009 at 21:48 UTC | |
by Pomax (Initiate) on Jan 08, 2009 at 21:59 UTC |