Re: Regular Expressions on Unicode

When I write while (<>){...} with the intention of using the script on utf8 data that comes from either named files in @ARGV or redirected/piped input via STDIN, I normally include these lines near the top:

use open IN => ':utf8';
binmode STDIN, ':utf8';
[download]

The first line takes care of making sure that all files in @ARGV get opened with the intended encoding layer, and the second line covers STDIN. (I also typically include , OUT => ':utf8' on the first line, and add a third line for STDOUT.)

The difference between ":encoding(utf8)" and just plain ":utf8" is, I think, simply a matter of how much you want to trust your input. If there are encoding errors (sequences of non-ASCII bytes that do not form valid utf8 characters), the simpler form will just cause the program to die with an error message, whereas ":encoding(utf8)" will give a detailed warning message, supply a replacement string that makes the problem easy to spot, and keep running.

(updated code snippet to normalize quotes)

Comment on Re: Regular Expressions on Unicode Select or Download Code