graff has asked for the wisdom of the Perl Monks concerning the following question:

I tend to write a lot of filtering scripts that read either from STDIN or from one or more files named in @ARGV -- I just start them with a  while(<>) loop, and I like the flexibility of using them two different ways:
myscript some.data > output # or grep -h foo *.data | sort -u | myscript > output
But what if I need to apply  binmode() on the input file handle (e.g. because the data needs to be read as utf8)? For stdin-stdout usage, that's no problem -- just:
binmode STDIN, ":utf8";
But what about for files in @ARGV? I know these get opened via the "magical" ARGV file handle, and I know (from having just tried it) that this does not DWIM:
#!/usr/bin/perl binmode STDIN, ":utf8"; # covers pipe input binmode ARGV, ":utf8"; # does not work -- handle isn't open yet while (<>) { do_whatever( $_ ) }

I've tried using the '-C' option on the shebang line, but it turns out that -C has problems when there are other option flags on the shebang line. In fact there's a thread at perlbug (ticket #34087, for those keeping score) that indicates this flag is known to be broken and apparently will be phased out. (I first realized the problems when a script using -C, which worked in 5.8.7, failed to work in 5.8.8.)

Note that  use encoding "utf8"; only affects STDIN and STDOUT -- no effect on ARGV. I know I can do something like this:

#!/usr/bin/perl use strict; my @files; if ( @ARGV ) { @files = @ARGV; } elsif ( -t ) { die "I want file names to open, or else pipeline input"; } else { @files = "stdin"; } for my $file ( @files ) { my $fh; if ( @ARGV ) { open $fh, "<:utf8", $file; } else { binmode STDIN, ":utf8"; $fh = \*STDIN; } while (<$fh>) { do_whatever( $_ ) } }
But that sucks. I could also give up the convenience of "dual usage" -- e.g. just write scripts to read from STDIN only, and never use the "magical" ARGV file handle -- but that would be sad. Using environment or locale settings would be fairly impractical as well (consecutive command lines might need to use different encodings).

Can someone point out a better way to do this? Or maybe the powers that be could be talked into fixing and keeping the -C option? (I suppose this will be a non-issue when Perl 6 becomes the tool of choice...)

(update: added declaration for $fh in last code snippet, to make it grammatical)

Replies are listed 'Best First'.
Re: Using binmode on ARGV filehandle?
by ruzam (Curate) on Jun 02, 2006 at 22:14 UTC
    Simple. What you want is:
    #!/usr/bin/perl use open IN => ":utf8"; # Now all file handles opened for input (including ARGV) # will use utf8 encoding unless told otherwise while (<>) { do_whatever( $_ ); }
    If you want specifically want binmode (not utf8) use ":raw" instead. The book suggests this may only be workable in newer versions of Perl. Works for me using v5.8.5
      Aha! This seems to contradict some of the statements made in the older threads that tye had pointed to in his initial reply above. Given that those threads are a few years old now, it would seem that the problems cited back then have been fixed.

      Thanks! This appears to do what exactly I want, in both 5.8.7 and 5.8.8.

Re: Using binmode on ARGV filehandle? (open.pm)
by tye (Sage) on Jun 02, 2006 at 21:51 UTC
      Thanks -- those are very helpful links. I found out from reading those that using the PERLIO environment variable does in fact provide the desired effect. For example, this script:
      #!/usr/bin/perl -w binmode STDOUT, ":encoding(utf16)"; while (<>) { print; }
      will correctly convert utf8 input to utf16 output, whether reading from STDIN or a list of one or more files in @ARGV, when I run it like this:
      $ export PERLIO=:utf8 $ myscript *.utf8 ## works the same as: cat *.utf8 | myscript
      Alas, if I just put  $ENV{PERLIO} = ":utf8"; into the script itself, this doesn't work. So to really free my script from undesirable environment dependencies, I could just make a wrapper script that sets $ENV{PERLIO}, then execs the actual filter script with @ARGV. Not exactly pretty, but not as ugly as other alternatives.

        You could use the same trick we use for fooling dynaloader or Oracle:

        BEGIN { if( ! $ENV{PERLIO} ) { $ENV{PERLIO}= ":utf8"; exec $^X, $0, @ARGV; } }

        - tye        

Re: Using binmode on ARGV filehandle?
by ikegami (Patriarch) on Jun 02, 2006 at 22:11 UTC
    Your code can be shrunk to:
    for my $file (@ARGV ? @ARGV : '-') { my $fh; if ($file eq '-') { open($fh, '-'); # "-" doesn't work with 3-arg open. binmode($fh, ':utf8'); } else { open($fh, '<:utf8', $file); } while (<$fh>) { do_whatever( $_ ) } }

    Differences from using while (<>):

    • eof() doesn't work the same.
    • eof(ARGV) doesn't work.
    • eof might not work the same.
    • $. doesn't work the same.
Re: Using binmode on ARGV filehandle?
by ikegami (Patriarch) on Jun 02, 2006 at 21:49 UTC

    In a while (<>) loop, "eof" or "eof(ARGV)" can be used to detect the end of each file (while "eof()" will only detect the end of the last file). I'd try:

    my $binmode = 1; while (<>) { binmode ARGV, ":utf8" if $binmode; $binmode = eof; # Don't add parens! do_whatever($_); }

    Update: Dang! The binmode comes to late to affect the first line of each file :(