Re: length() miscounting UTF8 characters?

How do you call the script? It seems you are feeding it with STDIN, which is not affected by use open IO. The following works for me (both in 5.16.2 and 5.10.1):

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

binmode STDOUT, 'utf8';
binmode DATA, 'encoding(utf-8)';
while (<DATA>) {
    chomp;
    s/[A-Za-z]//g;
    say $_, ' ', length;
}

__DATA__
æ
æð
æða
æðaber
æðahnútur
æðakölkun
æðardúnn
æðarfugl
æðarkolla
æðarkóngur
æðarvarp
æði
æðimargur
æðisgenginn
æðiskast
æðislegur
æðrast
æðri
æðrulaus
æðruleysi
æðruorð
æðrutónn
æðstur
æður
æfa
[download]

لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Comment on Re: length() miscounting UTF8 characters? Select or Download Code

Replies are listed 'Best First'.
Re^2: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 22:42 UTC
Yes, I'm piping the textfile into the script, though that's more for convenience than anything else. It'd be easy enough to change. I read up on the open pragma again and noticed that it can be fed another subpragma, `:std`, to affect the STD* streams: The :std subpragma on its own has no effect, but if combined with the :utf8 or :encoding subpragmas, it converts the standard filehandles (STDIN, STDOUT, STDERR) to comply with encoding selected for input/output handles. For example, if both input and out are chosen to be :encoding(utf8) , a :std will mean that STDIN, STDOUT, and STDERR are also in :encoding(utf8) . So I tried changing that line to `use open IO => ':std', ':utf8';` [download] but that didn't make a difference either. I'm probably still missing something fairly obvious. Thanks for your help, by the way!	[reply] [d/l]
Re^3: length() miscounting UTF8 characters? by choroba (Cardinal) on Apr 27, 2014 at 22:51 UTC
You are almost there. `use open IO => ':utf8', ':std';` [download] The order matters. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: length() miscounting UTF8 characters? by AppleFritter (Vicar) on Apr 27, 2014 at 22:59 UTC
Wonderful! That really works - hardly Perl at its dwimmiest, but I'll take what I can get. Thanks so much again!	[reply]
Re^5: length() miscounting UTF8 characters? by choroba (Cardinal) on Apr 27, 2014 at 23:06 UTC