psychomachine has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody, a linguist in pain over here.

I am trying to clean up a corpus of utf8 texts which contain mainly Russian Cyrillic by removing all the Latin text that's in them. Of course, I need to keep the punctuation, spaces etc.

Unicode seems to be doing fine, because when I use: print s/[\P{InCyrillic}]//g; I really get only Cyrillic strings piled up against each other, no punctuation, no spaces and no latin letters. So far, so good.

But when I try to create my own character definition in a subroutine, things stop working as expected.

#! usr/local/perl use utf8; sub InRussian{ return <<'END'; +utf8::Cyrillic +utf8::Mark +utf8::Number +utf8::Punctuation END } print s/[\P{InRussian}]//g;
I only get digits without spaces, which makes no sense to me.

Even when my subroutine contains only +utf8::Cyrillic-- the result is digits and not the same as print s/[\P{InCyrillic}]//g;

I'm calling the perl script from bash with -CS -p. I'm using perl 5.8.6 on Mac OS X. Any help would be greatly appreciated.

Mike

Replies are listed 'Best First'.
Re: problem with user-defined unicode character properties
by BrowserUk (Patriarch) on Jun 11, 2007 at 14:19 UTC

      Unfortunately, that doesn't solve the problem, although it may bring us a step closer to the solution.

      Without the trailing spaces, I've discovered that the last character pattern in the subroutine does get executed as necessary. For instance:

      #! usr/local/perl use utf8; sub InRussian{ return <<'END'; +utf8::Cyrillic +utf8::Punctuation END } print s/[\P{InRussian}]//g;
      gets me only numbers and punctuation, whereas having the +utf8::Cyrillic after +utf8::Punctuation in the subroutine produces the same output as the direct application of the InCyrillic pattern print s/[\P{InCyrillic}]//g;

      Does this make any sense?

        Um...two guesses.

        1. You are using negation \P{} and NOT(+A +B) doesn't mean what you intend. Eg. 'not in A and not in B'?

          Maybe you need (something like):

          sub NotInRussian{ return <<'END'; !utf8::Cyrillic !utf8::Punctuation END } ... s/\p{NotInRussian}//g
        2. It might have something to do with this from the POD?
          A final note on the user-defined property tests and mappings: they will be used only if the scalar has been marked as having Unicode characters. Old byte-style strings will not be affected.

          Does your editor produce unicode source files? Will Perl promote ASCII source to unicode?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: problem with user-defined unicode character properties
by PreferredUserName (Pilgrim) on Jun 11, 2007 at 18:56 UTC
    The "-p" switch already implies printing the line, so you don't need to do it explicitly.

    print() returns the number of characters printed, which in your case is whatever s/// returns. Those are the numbers you're seeing.

    Change the last line to:

    s/[\P{InRussian}]//g;
    and it should work.