A "binary" file for us:
C:\>perl -e "print qq(\xB5)" > data.bin
And:
use strict;
use warnings;
use feature 'say';
use Encode qw/ _utf8_off _utf8_on is_utf8 /;
use utf8;
use Devel::Peek;
my $s1 = ' '; # a space (anything)
_utf8_on( $s1 ); # or assign not-ascii above, instead
my $s2 = $s1;
open my $fh, '<', 'data.bin';
binmode $fh;
sysread $fh, $s1, 1;
Dump $s1;
seek $fh, 0, 0;
$s2 = do { local $/; <$fh> };
Dump $s2;
SV = PVMG(0xc149ec) at 0xc20dec
REFCNT = 1
FLAGS = (PADMY,SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0xc15a1c "\302\265"\0 [UTF8 "\x{b5}"]
CUR = 2
LEN = 10
MAGIC = 0xc13ffc
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = -1
SV = PV(0x3f9f6c) at 0xc20f0c
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0xc2e6a4 "\265"\0
CUR = 1
LEN = 10
Not sure if it's a bug or not.
Note that if the filehandle has been marked as :utf8 , Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread are in Unicode characters)
Does this imply, that if FH has not been marked, OFFSET is treated as bytes? Then, possibly, utf8 becomes invalid?
I think that if OFFSET was 0, then string utf8-ness should match file's IO encoding layer. I.e., read produces same result as slurping, above. Regardless of content of original scalar. And, if OFFSET was not zero, then? It should be documented more clearly, perhaps. About combinations that should never be used.
BTW, it looks like it's about this bug. Tk passes file name as utf8, this parameter is (rather recklessly) re-used (!) to receive file content.
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.