in reply to split text into words -- Unicode problem (I guess)

ї is not a valid character in perl; you need \x{1111}. Also, look for unicode character properties in perlunicode , you should probably find \p{L} class useful in regexes:
$_ = "a\x{2625}\x{10000}"; print map { sprintf "%x\n", ord } m/(\p{L})/g; 61 10000

Replies are listed 'Best First'.
Re^2: split text into words -- Unicode problem (I guess)
by bogdan77 (Initiate) on Mar 29, 2007 at 14:49 UTC
    @andye Um... no. Simply using space as delimiter would give me "monarchy.", "Japan.", "company?", and so on, not just the words themselves. @dk The text in the variable textBlock doesn't contain "ї" constructs -- it contains the real characters (for example, ț is "T with comma below"). The html encoding changed those characters into "ї" constructs when I submitted them.
      Argh... let me try again... @dk The text in the variable textBlock doesn't contain &#(numbers); constructs -- it contains the real characters (for example, &#five3hree9ine; is "T with comma below"). The html encoding changed those characters into &#(number); constructs when I submitted them.
        it works with "use encoding utf8;" thanks a lot, guys :^)
        If your script is written in utf8, use utf8 is needed to tell Perl about it. See more in utf8.
        I used "use utf8;", and still no joy... :-( It seems there's some kind of overlaping, even if the delimiters are in proper order...