Well it's clear my understanding of decoding and encoding were completely the wrong way around. I really wish I'd looked at your second paragraph more closely before spending the entire night trying to just bypass the encoding completely and setting the utf8 flag manually since my assumption was if I know I have a text file with UTF-8 encoding and perl uses utf-8 internally when told to it should be as simple as doing a sysread into a scalar and flagging it as utf8 so I created the file test_utf8 and set it's contents to this has wide utf8 chars like ❇ (snowflake) and tested with the following SSCCE.
use utf8;
sub fileread; # use do { local $/; <$fh> }
my $file = 'test_utf8';
# Test 1
binmode STDOUT, ':encoding(UTF-8)';
my $line = fileread $file,':raw';
utf8::decode($line);
if ($line =~ /(❇)/) { print "found '$1'\n"; }
print $line;
sub fileread {
my ($file,$enc) = @_;
my $string; my $stref = \$string;
open(my $fh, "< $enc", $file) || die "Can't open $file: $!";
${$stref} = do { local $/; <$fh> };
return $string;
}
This prints
found '❇'
this has wide utf8 chars like ❇ (snowflake)
which is the desired behaviour but other tests produce more puzzling results. For example using fileread $file,':encoding(UTF-8)'; or fileread $file,':encoding(ISO-8859-1)' produced identical results but the following test
my $line = fileread $file,':encoding(UTF-8)';
$line = Encode::decode('UTF-8', $line, 'Encode::FB_CROAK');
Was (I'm sure) producing this has wide utf8 chars like ‡ (snowflake) a few hours ago but is now crashing the script giving Undefined subroutine &Encode::decode called at - line 18. if binmode is commented out and Wide character at - line 18. if it isn't. Maybe it was utf8::encode giving me the first line, things are getting kinda hazy at this point. It does produce the correct result when used with fileread $file,':raw' or fileread $file,':encoding(ISO-8859-1)'. Interestingly unicode_strings made no difference to the regex succeeding or failing in any of my tests as and utf8::upgrade/downgrade don't appear to do anything at all in this SSCCE. It would be nice to conclude that when in doubt just use utf8::decode but I've also been testing with Net::Async::FastCGI which also gives me a tied STDOUT only it does UTF-8 encoding on it which I need to turn off with set_encoding( undef ); if I do that.
ps I notice all the occurrences of ❇ in my code blocks have been turned into ❇ so it's some small comfort that perlmonks.org can't quite get a grip on this either. 😜 |