http://qs1969.pair.com?node_id=738869


in reply to Re^2: Modern best practices for multilingual regexp alphabetical character matching?
in thread Modern best practices for multilingual regexp alphabetical character matching?

Everything "looks" fine until you try to extract substrings in some way. That's because without decoding your data on input the strings are handled as sequences of bytes, so a character like ä translates to two bytes.

Now if you extract some part of string and didn't decoded it first, you can accidentally rip apart these two bytes, leaving behind encoding garbage - usually not a good idea.

So I recommend to properly decode UTF-8 (and other character encodings) during input, and encode the strings on output. And use utf8; if you have string constants in your source code.

Replies are listed 'Best First'.
Re^4: Modern best practices for multilingual regexp alphabetical character matching?
by mea (Initiate) on Jan 26, 2009 at 09:59 UTC

    Thanks for the answer.

    So basically I am safe as long as I use these two lines, and "use utf8;" on top of my script every time. This has been really the most confusing thing so far, the otherwise excellent "Learning Perl" doesn't mention these problems at all, and some of the examples don't work correctly with utf-8 characters. which is fine for English speaking beginners, but people working on other languages have to deal with this issue right from the start. Could have saved a lot of time if it mentioned simply "for non-English languages or utf-8 add this to your script". Well, at least now I know and can go back to learn the "proper" stuff... Thanks again,

    Best Regards,

    Martin