in reply to Filtering passwords from the output of a script

This is not supposed to be obfuscated...

I understand, and the OP code (once I grokked it) actually struck me as more clever than obfuscated. Nice, even.

But if you really want completely unbuffered output to the log file and anywhere else, you have to use syswrite and sysread, and you have to use those exclusively. The standard i/o methods (i.e. print and <FILEHANDLE>) are intrinsically line-oriented --

Well, actually, print and the diamond operator are record oriented, where the record delimiter is defined by the globals $/ and $\. So you could try playing with those -- but that won't be any less obfuscative than just using sysread and syswrite.

I had to play with it a bit, but the following version of the OP code does what I think you want in terms of making sure that output passes through the pipe and into the log file with the shortest possible buffering delay.

(Note that you cannot avoid doing word-level buffering, because without that, you wouldn't be able to use s/// to remove the password string. Also, a lot of stuff in the "internal" script needs to be escaped in order to work right -- I was actually puzzled that "\s" had to be "\\s" in the "if" clause, but doing "\\Q" and "\\E" was apparently unnecessary.)

use warnings; use strict; my $password = "hide_me"; my $log_file = "test.log"; open STDOUT, "| perl | tee -a $log_file"; my $script = <<_END_FILTER_SCRIPT_; use warnings; use strict; my \$chr; \$_ = ''; while (sysread DATA, \$chr, 1) { \$_ .= \$chr; if ( /\\s\$/ ) { s/\Q$password\E/removed/ig; syswrite STDOUT, \$_; \$_ = ''; } } close STDOUT; __DATA__ _END_FILTER_SCRIPT_ syswrite STDOUT, $script; sleep 1; ## had to add this delay so next syswrite would always work syswrite STDOUT, "The password is $password\n"; foreach (qw{Output is unbuffered if these words are printed one at a t +ime.}) { syswrite STDOUT, $_." "; sleep 1; } syswrite STDOUT, "\n"; sleep 1; syswrite STDOUT, "Output is line buffered if the last line came out al +l at once.\n"; close STDOUT;
I agree that having to stuff all those backslashes into the internal script is really ugly, so maybe you'll want to externalize that part of the code into a separate script file, as suggested in the first reply. (But that means storing your password string in yet another file, so as not to expose it in the system's process table.)

(update: removed unnecessary "binmode" calls)

Replies are listed 'Best First'.
Re^2: Filtering passwords from the output of a script
by quester (Vicar) on Nov 29, 2006 at 07:48 UTC
    graff and ikegami,

    Thanks++ to both of you. After getting a lot of inspiration from all your posts I tracked the problem down to the diamond operator, "<DATA>". A working, although possibly nonportable, version is at the bottom of this post.

    In my current environment (Linux 2.6, Perl 5.8) it isn't necessary to use syswrite; ordinary print statements work fine after setting autoflush ($| = 1.) This is very fortunate because I would like to stay away from modifying the guts of Perl Expect.

    The sleep statement that graff put in actually works like this: Perl is already trying to do buffered input on DATA in the inner script, and it tries to grab some of the input from the pipe. It hangs on to it when sysread is called. (That baffled me for a while, but putting a "print <DATA>" at the end of the script printed the missing first lines of input.) When I changed the sysread to read it worked reliably... on my current platform. A simple getc also works... again, on my current platform.

    If anyone can explain the real difference between Perl's sysread and read I'm all ears. What I found refers me to the documentation on the Unix functions read and fread, which look almost the same. Is using read or getc in this script likely to break if it's ported to a different operating system?

    If I do need to use sysread to avoid portability issues, I'm very much loath to put a sleep into production code. I tried select undef, undef, undef, 0.0001; on my laptop and it was long enough sometimes and not others. Ten microseconds was never enough and one millisecond was always enough... tonight, with nothing else running. My experience has been that the length of the sleep needed will vary by enormous factors depending on the details of the system load. I suppose that it if it was really necessary a file could be used as an "I am ready now" flag between the two processes.

    Thanks again for the suggestions about too many backslashes (I changed from a here document to single quotes) and the trick of waiting for whitespace to check for a password that arrives in pieces - I had missed that one.

    Sorry if I'm getting long winded. The big question: is getc a good way of doing this or is it nonportable?

    The current version of the code (with getc) looks like this.

    use warnings; use strict; my $password = "hide_me"; my $log_file = "test.log"; open STDOUT, "| perl | tee -a $log_file"; select STDOUT; $| = 1; print ' use warnings; use strict; select STDOUT; $| = 1; my $chr; $_ = ""; while ($chr = getc DATA) { $_ .= $chr; next unless $chr =~ /\s/; s/' . (quotemeta $password) . '/removed/ig; print; $_ = ""; } print; __DATA__ '; print "The password is $password\n"; foreach (qw{Output is unbuffered if these words are printed one at a t +ime.}) { print $_, " "; select undef, undef, undef, 0.3; } print "\n"; foreach (split //, "Even one character at a time: $password should be +filtered.\n") { print; select undef, undef, undef, 0.1; } close STDOUT;
    The output looks like this. The second and third lines are both printed a word at a time.
    The password is removed Output is unbuffered if these words are printed one at a time. Even one character at a time: removed should be filtered.

      The read and getc Perl functions correspond to the fread and fgetc C-lib functions. The C-lib I/O functions are buffered. When you ask for X bytes, it might actually read X+Y bytes internally. Subsequent reads will read from the Y bytes first.

      This is very good in your case, because you keeps asking for one byte. Most of the time, you'll just be reading from the buffer, which is faster than doing a real read.

      How do the C-lib functions get their data? Through system calls. sysread is Perl's interface to the read system call. The system I/O functions are not buffered.

      A key difference between read and sysread is that read(FH, $buf, $bytes) will wait for $bytes bytes to be available, whereas sysread(FH, $buf, $bytes) will return as soon as bytes become available. $bytes is simply a maximum for sysread. I took advantage of this to read in more than one byte at a time.

      C-lib and system functions should not both be used on the same file handle.


      Your usage of getc is incorrect. It returns undef at the end of input, not false.

      The select STDOUT; is useless.

      print ' use warnings; use strict; $| = 1; $_ = ""; while (defined($chr = getc DATA)) { $_ .= $chr; next unless $chr =~ /\s/; s/' . (quotemeta $password) . '/removed/ig; print; $_ = ""; } print; __DATA__ ';

      (Update: Nevermind, next unless /\s/; is buggy.

      Probably faster:

      print ' use warnings; use strict; $| = 1; $_ = ""; for (;;) { sysread(DATA, $_, 4096, length()) or last; next unless /\s/; s/' . (quotemeta $password) . '/removed/ig; print; $_ = ""; } print; __DATA__ ';

      )

        Thanks again. That's a good point about changing while ($chr...) to while (defined ($chr...)). Without it, a null character will cause the logging subprocess to exit, which would be very sad.

        Two minor notes about the sysread code: one would be to add some synchronizing code so the parent process can wait until first sysread is in progress. That way Perl won't eat any of the parent's output into its buffers for <DATA>. The second would be to change next unless /\s/ to next unless /\s$/ so that a burst of characters that contains a space followed by the start of the password won't be printed before the password can be removed.

Re^2: Filtering passwords from the output of a script
by ikegami (Patriarch) on Nov 29, 2006 at 06:38 UTC
    Three issues:
    • It is word-buffered. It will only output when whitespace is received.
    • It won't always work if the password has a space in it.
    • I imagine that reading a byte at a time (when more than a byte is available) is much slower than reading a chunk of bytes.

    See Re^3: Filtering passwords from the output of a script.

    By the way, binmode might be required. I remember having problems when attempting to read characters (as opposed to bytes) using sysread, but the details escape me.

      Good points, although /\s/ actually also matches newline and carriage return. As a sheer stroke of fortune, in my current application a restriction against using spaces or non-ASCII characters in passwords is appropriate. Also, I don't have to worry too much about efficiency, my logs only arrive at a few hundred bytes per second typically. Thanks again!