Unicode and regexes

hotshot has asked for the wisdom of the Perl Monks concerning the following question:

ho guys!

I'm checking the overhead of supporting unicode in my Perl project, as I managed to see till now, without using any unicode module (utf8), Perl just "gives what she gets", for example when I used opendir to get dirs list under a given directory and I have there dirs opened in korean or german language (in utf8), perl receives it and displays it properly.

The problem starts when I try to manipulate the directory with a regular expression. does it mean I'll have to change all my regexps (endless regexps) to support unicode (using IsAlnum and '-' for \w for example), the regexps will be much complicated (long), and won't have all the power of old ones?

Hotshot

Edited: ~Wed Oct 30 16:38:08 2002 (GMT) by footpad: Retitled (was Unicode), added <P> tags, and fixed minor spelling errors - per Consideration

Comment on Unicode and regexes

Replies are listed 'Best First'.
Re: Unicode and regexes by dakkar (Hermit) on Oct 30, 2002 at 17:51 UTC
The regexps, per se, don't need any change (I'm assuming Perl 5.8.0, since 5.6.x had some problems). You need to assure two things: that your strings are correctly encoded that Perl knows it The first is a problem in itself, but a bit off-topic. The second can be done in two ways: if the strings come from a filehandle, you can use something like `open(FH, "<:utf8", "file")` to tell Perl to treat data as utf-8 (or use the `:encoding` layer, see `perldoc -f open` otherwise (such as your example, from a dirhandle), `use Encode;` and `$string=Encode::decode("utf-8",$string);`	[reply] [d/l] [select]
Re: Re: Unicode and regexes by hotshot (Prior) on Oct 31, 2002 at 07:54 UTC
and if I still use Perl 5.6.1? Hotshot	[reply]