Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

A "binary" file for us:

C:\>perl -e "print qq(\xB5)" > data.bin

And:

use strict; use warnings; use feature 'say'; use Encode qw/ _utf8_off _utf8_on is_utf8 /; use utf8; use Devel::Peek; my $s1 = ' '; # a space (anything) _utf8_on( $s1 ); # or assign not-ascii above, instead my $s2 = $s1; open my $fh, '<', 'data.bin'; binmode $fh; sysread $fh, $s1, 1; Dump $s1; seek $fh, 0, 0; $s2 = do { local $/; <$fh> }; Dump $s2;
SV = PVMG(0xc149ec) at 0xc20dec REFCNT = 1 FLAGS = (PADMY,SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0xc15a1c "\302\265"\0 [UTF8 "\x{b5}"] CUR = 2 LEN = 10 MAGIC = 0xc13ffc MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = -1 SV = PV(0x3f9f6c) at 0xc20f0c REFCNT = 1 FLAGS = (PADMY,POK,pPOK) PV = 0xc2e6a4 "\265"\0 CUR = 1 LEN = 10

Not sure if it's a bug or not.

Note that if the filehandle has been marked as :utf8 , Unicode characters are read instead of bytes (the LENGTH, OFFSET, and the return value of sysread are in Unicode characters)

Does this imply, that if FH has not been marked, OFFSET is treated as bytes? Then, possibly, utf8 becomes invalid?

I think that if OFFSET was 0, then string utf8-ness should match file's IO encoding layer. I.e., read produces same result as slurping, above. Regardless of content of original scalar. And, if OFFSET was not zero, then? It should be documented more clearly, perhaps. About combinations that should never be used.

BTW, it looks like it's about this bug. Tk passes file name as utf8, this parameter is (rather recklessly) re-used (!) to receive file content.


In reply to Read (sysread) binary data into utf8 string by vr

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2024-03-29 02:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found