Re: Regexp and Linux (is it utf issue?)

When you read text files, you should decode them. This is easy using PerlIO layers, Encode module and three-argument form of open:

use Encode;
open my $fh, "<:encoding(whatever)", $filename or die $!;
[download]

This way, Perl decodes everything automatically, and you only have to work with characters, not bytes.

When you write text to files, writing characters produces the famous warning: "wide character in (sub name)...". You need to encode them using the same technique: open my $write, ">:encoding(whatever)", $filename or die $!;. You can use :utf8 layer to encode characters because they are internally stored as valid UTF-8.

Do not use :utf8 iolayer to decode text because it simply sets "character" flag on the strings read from filehandles without any checks and this is generally unsafe: UTF8 related proof of concept exploit released at T-DOSE.

Comment on Re: Regexp and Linux (is it utf issue?) Select or Download Code