Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

UTF-8 and systemIO are not friends anymore

by Tommy (Chaplain)
on Dec 30, 2015 at 23:07 UTC ( #1151547=perlquestion: print w/replies, xml ) Need Help??

Tommy has asked for the wisdom of the Perl Monks concerning the following question:

As of Perl 5.23, using :utf8 binmode on filehandles opened with sysopen causes a warning (or at least it does as soon as someone tries to syswrite/sysread).

From what I read in perldoc perlport -- it's a portability issue because sysseek'ing and tell'ing get garbled and confused. Makes perfect sense why you'd want to avoid it.

I've written the following statement as part of a documentation effort, based on my understanding of the new deprecation, and I wonder if any of it is inaccurate or even plain less-than-ideal. I have asked the same question to the perl5 core hackers on IRC, but while I'm waiting on a response from them, I'm still interested in what you think about it.

Specifically because I'm not a C expert or perl-guts guru, I want to be doubly sure this is right before I start disseminating it:

As a rule of thumb, and to avoid confusion, stick to the use of utf8 binmode for unicode text streams only, or if you don't have a compelling need for that, you can use lower level system IO via sysopen/sysread/syswrite and be done with it. In the end, bytes at rest in files are just bytes. If you aren't trying to encode your text as UTF-8 strict, you can safely use system IO without adding any additional IO disciplines/layers and you should be fine. With system IO, no translations will occur on your input/output streams that could cause your file content to be mangled. You can write raw text or even binary streams and never worry.

UPDATE: If you're reading this on the front page, please click on in-- the discussion on this post is a worthwhile read, and contains corrections.

Tommy
A mistake can be valuable or costly, depending on how faithfully you pursue correction
  • Comment on UTF-8 and systemIO are not friends anymore

Replies are listed 'Best First'.
Re: UTF-8 and systemIO are not friends anymore (unfortunate)
by tye (Sage) on Dec 31, 2015 at 00:25 UTC

    I don't see any problems with your presented statement.

    I was rather shocked when I stumbled across the new sysread behavior for UTF-8 streams. It seems rather nonsensical to me, demonstrating a lack of understanding or appreciation of several of the reasons why Perl even has sysread().

    One not-yet-published module of mine notes:

    At least in some versions of Perl, if a file handle has the ":utf8" encoding layer in place [such as via binmode()], then sysread() does some weird and unfortunate things under the covers beyond just calling the read(2) system call once. This makes 4-argument select() incompatible with sysread() for those versions of Perl (when the ":utf8" layer is used).

    [....]

    You should probably avoid the ":utf8" layer for file handles you wish to use with 4-argument select().

    - tye        

      t seems rather nonsensical to me
      Why? Perl's sysread:
      sysread FILEHANDLE,SCALAR,LENGTH,OFFSET
      (my system's) pread:
      ssize_t pread(int fd, void *buf, size_t count, off_t offset);
      How can it work with anything other than bytes and still call itself sys-something? (even if Perl's pread is actually lseek and read... that doesn't change much wrt utf-8)
        LOL! Shows how much I use sysread... But anyway, the argument about seek stands. But sorry! :)
Re: UTF-8 and systemIO are not friends anymore
by Anonymous Monk on Dec 31, 2015 at 01:07 UTC
    Well, deprecating using :utf8 with sysread and stuff seems very reasonable to me. OTOH, I do see problems with your statement. In fact, it doesn't even make much sense to me (sorry)
    As a rule of thumb, and to avoid confusion, stick to the use of utf8 binmode for unicode text streams only
    All strings in the Perl programming language are unicode text strings, and all streams produce those, so what exactly do you mean by "unicode text streams"? You probably meant something like "use only read, readline, print on streams marked (using binmode, open etc.) with utf-8".
    if you don't have a compelling need for that, you can use lower level system IO via sysopen/sysread/syswrite and be done with it. In the end, bytes at rest in files are just bytes. If you aren't trying to encode your text as UTF-8 strict, you can safely use system IO without adding any additional IO disciplines/layers and you should be fine.
    Just delete that part, IMO. People who understand what using thin wrappers over (system) read, pread implies (wrt buffering, for example) do not need to read that, and people who don't understand don't need to use sysread and syswrite, but they might think it's an advice to use those.
    With system IO, no translations will occur on your input/output streams that could cause your file content to be mangled. You can write raw text or even binary streams and never worry.
    That's confusing to me. "Mangled" in Perl pretty much always means "improper upgrading or downgrading". If the new changes cause the string to not be downgraded when using syswrite - well, that would be just wrong! but I'm sure that won't be the case - so "raw text" can still get mangled... just delete it, too, IMO.

    So, basically, I don't like any of it :) Yeah, sorry, but it's just written in a pretty confusing manner, and there definitely shouldn't be any recommendation or encouragement to use sys functions.

      there definitely shouldn't be any recommendation or encouragement to use sys functions.
      I think I should add why: people already get confused by strings in Perl, and the way you wrote it, they might think using sys funcs might be a solution to corrupted characters and other such things, which is not the case and Perl's documentation shouldn't imply that... it's already confusing enough, unfortunately.

        it's already confusing enough, unfortunately.

        ...Which is unfortunately what I'm trying to help avoid. I'm getting more confused about it myself! This, sadly, isn't helping the situation.

        My feeling is that IO should not try to do fancy things for people that they didn't ask for. Just an opinion, and obviously not shared by all. The core issue is that a lot of people don't know how to do portable IO on their own (not implying that my understanding of it is without flaw), and it takes a large amount of time investment to learn how to do it properly, and so we can end up with a DWIM attitude about it...

        And that leads to a DWIM implementation. In an effort to be portable, maybe, and certainly in an effort to make things easy on people who just want to reliably read and write files, PerlIO (as it's referred to by many), effectively layers on more activity than C 'system' open/read/write operations. In fact, the trend toward our current situation feels like it has been progressively more DWIM over the years. Again, some might disagree, and I do confess that I haven't read the full release notes of every Perl either. But this isn't the point...

        With multi-byte characters, the situation just snowballs progressively into more confusion. I openly admit that I do not yet know, but am working on obtaining a better understanding of the variances in behaviors of Perl's many actively-used versions on all major many platforms, in addition to pre-baked Perl "distributions" (e.g.- strawberry), and individual compilation options when compiled from source (e.g.- perlbrew). Presently I do not know how many permutations of behavior there are, and how to account for all of them when attempting to establish best-practices for completely portable IO--if not for others, at least for me. In the very least, I want to know more and do better than I currently can for my own endeavors.

        So, to eliminate all unknowns, "I go back to the beginning." I go as low as Perl lets me go (system IO), and don't rely on anything that PerlIO might layer on top. If I do that, I don't have to peel back the layers of every onion and figure out what is happening and then compensate when necessary. Honestly how can anyone ensure that what is written is what was intended, when using PerlIO, every time, on every platform, in every Perl? We can't, and NOT exclusively because of Perl. We don't know, and can't know, specifically because we don't always know what people intended to write; we don't know the human element's every intention.

        So this is where I sit scratching my head. It's a balancing act of decoding both Perl behavior and human intention.

        Some, I'm sure, will say that I'm a fool for overthinking things, or that I'm worrying too much. But even if I have to open myself up to a lambasting because of admittance that I don't completely know how to solve a problem because to me it's nature remains somewhat of mystery, I will hazard the shame <mild sarcasm> if I'm able to gain greater understanding by having posited my questions to the collectively much-more-intelligent community. Can anyone shed more light? Or is the answer to just go with PerlIO, "accept the defaults", trust in the decisions that are built into PerlIO, and let the chips fall where they may?

        Folks, I don't know how to eliminate all unknowns, other than to use system IO. When I do so, I do it under the assumption that I'm probably going to be able to safely write out all content, all the time, on all Perls, on all systems, without creating filename.mojibake when I wanted to create filename.txt -- so long as my output stream wasn't mojibake already. If input was garbage, the output to a file will be garbage, surely...but it won't be because of me.

        I do understand that if a stream is already mangled, that it's going to still be mangled when written via syswrite. I do expect that some will say that using PerlIO might have corrected the issue in some situations (maybe). But for now I am operating under the assumption that if I don't do anything behind the curtains (DWIM), I won't later have anything to explain. I can honestly say, hey, my software wrote out what you fed it.

        Tommy
        A mistake can be valuable or costly, depending on how faithfully you pursue correction

      All strings in the Perl programming language are unicode text strings, and all streams produce those, so what exactly do you mean by "unicode text streams"? You probably meant something like "use only read, readline, print on streams marked (using binmode, open etc.) with utf-8".

      Apologies, yes, that's what I meant. Please also see my comments today in reply to Anonymous Monk

      Tommy
      A mistake can be valuable or costly, depending on how faithfully you pursue correction

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1151547]
Approved by Old_Gray_Bear
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2022-12-08 22:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?