in reply to Re: split text into words -- Unicode problem (I guess)
in thread split text into words -- Unicode problem (I guess)

@andye Um... no. Simply using space as delimiter would give me "monarchy.", "Japan.", "company?", and so on, not just the words themselves. @dk The text in the variable textBlock doesn't contain "ї" constructs -- it contains the real characters (for example, ț is "T with comma below"). The html encoding changed those characters into "ї" constructs when I submitted them.
  • Comment on Re^2: split text into words -- Unicode problem (I guess)

Replies are listed 'Best First'.
Re^3: split text into words -- Unicode problem (I guess)
by bogdan77 (Initiate) on Mar 29, 2007 at 14:55 UTC
    Argh... let me try again... @dk The text in the variable textBlock doesn't contain &#(numbers); constructs -- it contains the real characters (for example, &#five3hree9ine; is "T with comma below"). The html encoding changed those characters into &#(number); constructs when I submitted them.
      it works with "use encoding utf8;" thanks a lot, guys :^)
      If your script is written in utf8, use utf8 is needed to tell Perl about it. See more in utf8.
      I used "use utf8;", and still no joy... :-( It seems there's some kind of overlaping, even if the delimiters are in proper order...