comment on

How come Unix's piping paradigm didn't make it into Perl? Or maybe it did and I didn't notice?

Yes, I know that one can open pipes like this:

open my $pipe, "foobar|" or die "$!\n";
print frobnicate( $_ ) while <$pipe>:
[download]

...but I have in mind something more integrated into Perl than that.

Specially after the introduction of lexical handles, I would like to be able to take a read handle and transform it somehow to modify its output.

For example, suppose the file foo.tsv consists of newline-separated records of tab-delimited fields, and I want to generate a "view" consisting of those records whose first field has the value 42. Furthermore, I only want fields 1, 3, and 8, and I want the resulting records to be sorted lexicographically. Finally, I want to put everything in foo_view.tsv. Easy:

{
  open my $in, 'foo.tsv' or die "$!\n";
  my @records;
  while ( <$in> ) {
    next unless /^42\t/;
    chomp;
    push @records, join( "\t", ( split "\t" )[ 1, 3, 8 ] ) . $/;
  }

  open my $out, '>', 'foo_view.tsv' or die "$!\n";
  print $out $_ for sort @records;
}
[download]

But here's a different way to think about this:

{
  open my $in, 'foo.tsv' or die "$!\n";

  $in = Filter::grepit( $in, qr/^42\t/ );
  $in = Filter::cols  ( $in, "\t", 1, 3, 8 );
  $in = Filter::sortit( $in );

  open my $out, '>', 'foo_view.tsv' or die "$!\n";
  print $out $_ while <$in>;
}
[download]

The function Filter::grepit takes an open read handle and a regex and returns a read handle that outputs only those records from the original handle that match the regex. The function Filter::cols takes an open read handle, a field delimiter, and a list of field numbers, and returns a record consisting of only those fields. Finally, Filter::sortit returns records in lexicographic order.

Admittedly, this code is not more succinct and not much clearer than in the first version, though, subjectively, I find it easier on the eye somehow. But the potential big win is in the fact that, in principle, to sort the records we no longer have to read all the records into a Perl array, which could take up a lot of memory. This problem is relegated to the implementation of sortit. Of course, sortit could end up doing precisely that behind the scenes, but it could do something else. For example, sortit could fork the job off to sort(1):

sub sortit {
  my ( $fh ) = shift;
  return pipeit( $fh, 'sort' );
}

sub pipeit {
  my ( $fh, $cmd ) = @_;
  my $new_fh;
  return $new_fh if my $pid = open $new_fh, '-|';
  die "Fork failed: $!\n" unless defined $pid;
  open my $pipe, "|$cmd" or die "Pipe failed: $!\n";
  print $pipe $_ while <$fh>;
  exit 0;
}
[download]

Now, even for huge files, we can let sort(1) handle the problem of creating intermediate sorted fragments, merging them, etc. I'm sure there are better ways to implement this kind of thing, but you get the idea.

Does anything like this already exist in CPAN? (The closest I've found is PerlIO layers, which I find pretty hard to use.)

PS: FWIW, here are implementations of grepit and cols:

sub grepit {
  my ( $fh, $keep ) = @_;
  my $new_fh;
  return $new_fh if my $pid = open $new_fh, '-|';
  die "Fork failed: $!\n" unless defined $pid;
  my $re = ref $keep ? $keep : qr/\Q$keep/;
  /$re/ && print STDOUT while <$fh>;
  exit 0;
}

sub cols {
  my ( $fh, $sep, @cols ) = @_;
  my $new_fh;
  return $new_fh if my $pid = open $new_fh, '-|';
  die "Fork failed: $!\n" unless defined $pid;
  print STDOUT join( $sep, ( split $sep )[ @cols ] ), "\n"
    while <$fh>;
  exit 0;
}
[download]

the lowliest monk

In reply to Pipe dream by tlm

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.