http://qs1969.pair.com?node_id=1151714


in reply to Re^2: UTF-8 and systemIO are not friends anymore
in thread UTF-8 and systemIO are not friends anymore

it's already confusing enough, unfortunately.

...Which is unfortunately what I'm trying to help avoid. I'm getting more confused about it myself! This, sadly, isn't helping the situation.

My feeling is that IO should not try to do fancy things for people that they didn't ask for. Just an opinion, and obviously not shared by all. The core issue is that a lot of people don't know how to do portable IO on their own (not implying that my understanding of it is without flaw), and it takes a large amount of time investment to learn how to do it properly, and so we can end up with a DWIM attitude about it...

And that leads to a DWIM implementation. In an effort to be portable, maybe, and certainly in an effort to make things easy on people who just want to reliably read and write files, PerlIO (as it's referred to by many), effectively layers on more activity than C 'system' open/read/write operations. In fact, the trend toward our current situation feels like it has been progressively more DWIM over the years. Again, some might disagree, and I do confess that I haven't read the full release notes of every Perl either. But this isn't the point...

With multi-byte characters, the situation just snowballs progressively into more confusion. I openly admit that I do not yet know, but am working on obtaining a better understanding of the variances in behaviors of Perl's many actively-used versions on all major many platforms, in addition to pre-baked Perl "distributions" (e.g.- strawberry), and individual compilation options when compiled from source (e.g.- perlbrew). Presently I do not know how many permutations of behavior there are, and how to account for all of them when attempting to establish best-practices for completely portable IO--if not for others, at least for me. In the very least, I want to know more and do better than I currently can for my own endeavors.

So, to eliminate all unknowns, "I go back to the beginning." I go as low as Perl lets me go (system IO), and don't rely on anything that PerlIO might layer on top. If I do that, I don't have to peel back the layers of every onion and figure out what is happening and then compensate when necessary. Honestly how can anyone ensure that what is written is what was intended, when using PerlIO, every time, on every platform, in every Perl? We can't, and NOT exclusively because of Perl. We don't know, and can't know, specifically because we don't always know what people intended to write; we don't know the human element's every intention.

So this is where I sit scratching my head. It's a balancing act of decoding both Perl behavior and human intention.

Some, I'm sure, will say that I'm a fool for overthinking things, or that I'm worrying too much. But even if I have to open myself up to a lambasting because of admittance that I don't completely know how to solve a problem because to me it's nature remains somewhat of mystery, I will hazard the shame <mild sarcasm> if I'm able to gain greater understanding by having posited my questions to the collectively much-more-intelligent community. Can anyone shed more light? Or is the answer to just go with PerlIO, "accept the defaults", trust in the decisions that are built into PerlIO, and let the chips fall where they may?

Folks, I don't know how to eliminate all unknowns, other than to use system IO. When I do so, I do it under the assumption that I'm probably going to be able to safely write out all content, all the time, on all Perls, on all systems, without creating filename.mojibake when I wanted to create filename.txt -- so long as my output stream wasn't mojibake already. If input was garbage, the output to a file will be garbage, surely...but it won't be because of me.

I do understand that if a stream is already mangled, that it's going to still be mangled when written via syswrite. I do expect that some will say that using PerlIO might have corrected the issue in some situations (maybe). But for now I am operating under the assumption that if I don't do anything behind the curtains (DWIM), I won't later have anything to explain. I can honestly say, hey, my software wrote out what you fed it.

Tommy
A mistake can be valuable or costly, depending on how faithfully you pursue correction

Replies are listed 'Best First'.
Re^4: UTF-8 and systemIO are not friends anymore
by Anonymous Monk on Jan 03, 2016 at 18:22 UTC
    I go as low as Perl let's me go (system IO), and don't rely on anything that PerlIO might layer on top. If I do that, I don't have to peel back the layers of every onion and figure out what is happening and then compensate when necessary.
    I understand. But, in Perl, sys functions are just not suitable for that. Consider:
    use 5.022; use warnings; use Fcntl; use Devel::Peek; sysopen my $fh, 'out', O_WRONLY | O_CREAT; my $buffer = "\xFF\xFF\xFF"; utf8::upgrade($buffer); Dump $buffer; syswrite $fh, $buffer;
    Folks, I don't know how to eliminate all unknowns, other than to use system IO.
    Bypassing PerlIO is not enough; using system IO doesn't solve anything. And anyway, there are many ways for errors to appear - string concatenation, for instance... It doesn't really have much to do with IO per se.
    Can anyone shed more light? Or is the answer to just go with PerlIO, "accept the defaults", trust in the decisions that are built into PerlIO, and let the chips fall where they may?
    It seems to me that p5porters are strongly opposed to any explanations about how text in Perl actually works, which is too bad. Perl's "unified" model of text is a particularly leaky abstraction, IMO. Basically, p5porters advise "decode all inputs, encode all output" (using Encode, for example, or open my $fh, '<:encoding(SOME_ENCODING)', or binmode, or some such). Of course, in practice some strings cannot be decoded, or sometimes you don't want to decode/encode anything but some module that you use does that for you anyway (that's relatively recent examples actually posted on Perlmonks).

    I'm not sure why they're opposed to document it... it's not like it's something difficult to understand. Why don't you ask them about it? If they're not actually against it, and just don't have time, someone else (maybe even I) can do it (especially if you'll then fix my grammatical, orphographic and other mistakes).

      I find that most insightful, and helpful. What I'm seeing here basically is that it's quite easy to 'break' strings. I haven't ever really tried, but I can see that if someone were to deliberately (or inadvertently) work such chicanery with text, it would probably break the 'protections' I've hoped to achieve with syswrite().

      With regard to the idea of your possible assistance in documenting these specific Perl behaviors (once understood/clarified), you probably doesn't need to do so with significant stipulations for linguistic mistakes when you use words like "orthographic" in reference to your own literary introspection ;-) I.E.- you good, bro.

      It would be a worthwhile undertaking, because I can't hope to successfully design working software implementations over undocumented behavior. I have some somewhat-pressing software deployments in the pipeline that are going to hinge on a firm understanding of these undocumented things.

      Tommy
      A mistake can be valuable or costly, depending on how faithfully you pursue correction