I am trying to clean up a corpus of utf8 texts which contain mainly Russian Cyrillic by removing all the Latin text that's in them. Of course, I need to keep the punctuation, spaces etc.
Unicode seems to be doing fine, because when I use: print s/[\P{InCyrillic}]//g; I really get only Cyrillic strings piled up against each other, no punctuation, no spaces and no latin letters. So far, so good.
But when I try to create my own character definition in a subroutine, things stop working as expected.
I only get digits without spaces, which makes no sense to me.#! usr/local/perl use utf8; sub InRussian{ return <<'END'; +utf8::Cyrillic +utf8::Mark +utf8::Number +utf8::Punctuation END } print s/[\P{InRussian}]//g;
Even when my subroutine contains only +utf8::Cyrillic-- the result is digits and not the same as print s/[\P{InCyrillic}]//g;
I'm calling the perl script from bash with -CS -p. I'm using perl 5.8.6 on Mac OS X. Any help would be greatly appreciated.
Mike| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |