Pipe dream

How come Unix's piping paradigm didn't make it into Perl? Or maybe it did and I didn't notice?

Yes, I know that one can open pipes like this:

open my $pipe, "foobar|" or die "$!\n";
print frobnicate( $_ ) while <$pipe>:
[download]

...but I have in mind something more integrated into Perl than that.

Specially after the introduction of lexical handles, I would like to be able to take a read handle and transform it somehow to modify its output.

For example, suppose the file foo.tsv consists of newline-separated records of tab-delimited fields, and I want to generate a "view" consisting of those records whose first field has the value 42. Furthermore, I only want fields 1, 3, and 8, and I want the resulting records to be sorted lexicographically. Finally, I want to put everything in foo_view.tsv. Easy:

{
  open my $in, 'foo.tsv' or die "$!\n";
  my @records;
  while ( <$in> ) {
    next unless /^42\t/;
    chomp;
    push @records, join( "\t", ( split "\t" )[ 1, 3, 8 ] ) . $/;
  }

  open my $out, '>', 'foo_view.tsv' or die "$!\n";
  print $out $_ for sort @records;
}
[download]

But here's a different way to think about this:

{
  open my $in, 'foo.tsv' or die "$!\n";

  $in = Filter::grepit( $in, qr/^42\t/ );
  $in = Filter::cols  ( $in, "\t", 1, 3, 8 );
  $in = Filter::sortit( $in );

  open my $out, '>', 'foo_view.tsv' or die "$!\n";
  print $out $_ while <$in>;
}
[download]

The function Filter::grepit takes an open read handle and a regex and returns a read handle that outputs only those records from the original handle that match the regex. The function Filter::cols takes an open read handle, a field delimiter, and a list of field numbers, and returns a record consisting of only those fields. Finally, Filter::sortit returns records in lexicographic order.

Admittedly, this code is not more succinct and not much clearer than in the first version, though, subjectively, I find it easier on the eye somehow. But the potential big win is in the fact that, in principle, to sort the records we no longer have to read all the records into a Perl array, which could take up a lot of memory. This problem is relegated to the implementation of sortit. Of course, sortit could end up doing precisely that behind the scenes, but it could do something else. For example, sortit could fork the job off to sort(1):

sub sortit {
  my ( $fh ) = shift;
  return pipeit( $fh, 'sort' );
}

sub pipeit {
  my ( $fh, $cmd ) = @_;
  my $new_fh;
  return $new_fh if my $pid = open $new_fh, '-|';
  die "Fork failed: $!\n" unless defined $pid;
  open my $pipe, "|$cmd" or die "Pipe failed: $!\n";
  print $pipe $_ while <$fh>;
  exit 0;
}
[download]

Now, even for huge files, we can let sort(1) handle the problem of creating intermediate sorted fragments, merging them, etc. I'm sure there are better ways to implement this kind of thing, but you get the idea.

Does anything like this already exist in CPAN? (The closest I've found is PerlIO layers, which I find pretty hard to use.)

PS: FWIW, here are implementations of grepit and cols:

sub grepit {
  my ( $fh, $keep ) = @_;
  my $new_fh;
  return $new_fh if my $pid = open $new_fh, '-|';
  die "Fork failed: $!\n" unless defined $pid;
  my $re = ref $keep ? $keep : qr/\Q$keep/;
  /$re/ && print STDOUT while <$fh>;
  exit 0;
}

sub cols {
  my ( $fh, $sep, @cols ) = @_;
  my $new_fh;
  return $new_fh if my $pid = open $new_fh, '-|';
  die "Fork failed: $!\n" unless defined $pid;
  print STDOUT join( $sep, ( split $sep )[ @cols ] ), "\n"
    while <$fh>;
  exit 0;
}
[download]

the lowliest monk

Comment on Pipe dream Select or Download Code

Replies are listed 'Best First'.
Re: Pipe dream by tilly (Archbishop) on Sep 09, 2005 at 03:57 UTC
In Perl 6 your operator will exist and be known as ==>.	[reply]
Re: Pipe dream by jdporter (Paladin) on Sep 09, 2005 at 03:07 UTC
I've often reflected on the fact that perl's (and other languages') equivalent of a pipe stream is stacked functions which take and return lists. (A canonical example can be found in the Schwartzian Transform.) The equivalence is even stronger when lists can be lazy. Languages like Haskell do it beautifully. Presumably Perl6 will too.	[reply]
Re: Pipe dream by Zaxo (Archbishop) on Sep 09, 2005 at 10:09 UTC
There is pipe. Pipe is lower level than shell pipes so is more flexible. It is possible to set up arbitrary networks of processes with pipe, fork, and select. The complexity of such things prevents their use, but that is not a limit of Perl. It is pretty easy to set up a coprocess which you can print stuff to, and read back the result. Skeletal version: `my ($pin,$pout,$cin,$cout,%kid); { pipe $pin, $cout; pipe $cin, $pout; my $cpid = fork; die $! unless defined $cpid; $kid{$cpid} = 1, last if $cpid; # parent close $pin or die $!; # in child to end of block close $pout or die $!; while (<$cin>) { # do filter-like stuff to $_ print $cout $_; } exit 0; } # parent close $cin or die $!; close $cout or die $!; # print to $pout and read from $pin, maybe in a select loop # depends on expected behavior of the child. delete $kid{wait()} while %kid;` [download] The complication of that could be wrapped in a module, and probably has been. After Compline, Zaxo	[reply] [d/l]
Re: Pipe dream by chb (Deacon) on Sep 09, 2005 at 08:25 UTC
Please have a look at Higher Order Perl by Dominus. Your pipes are called streams there and are implemented with infinite lists. They are not limited to filehandles as data source. Very interesting stuff.	[reply]
Re^2: Pipe dream by tlm (Prior) on Sep 09, 2005 at 12:22 UTC
I'm quite familiar with streams, and with HOP's treatment of them. I think it would be exceedingly awkward to implement the example in my OP using streams. (For one thing, since I am sorting in the last step, I have to fully "actualize" each stream, which means that they become garden-variety (finite) linked lists, each one of them resident in memory.) What I'm talking about has much more to do with iterators, which Dominus also covers extensively in HOP. the lowliest monk	[reply]
Re^3: Pipe dream by Anonymous Monk on Sep 09, 2005 at 18:35 UTC
Well, you can't sort infinite streams. I haven't read HOP, but usually streams and iterators are the same thing. Lazy lists, on the other hand, are nothing but memoized iterators. You can sort iterators without storing the entire thing in memory, but it requires multiple passes through contents of the iterator (N passes for the bubble sort, although you can trade this off for increased memory usage. That is you can do it in N/2 passes, if your bubble sort floats two items to the top at a time).	[reply]
Re^4: Pipe dream by Dominus (Parson) on Sep 12, 2005 at 13:33 UTC
Re^5: Pipe dream by Anonymous Monk on Sep 12, 2005 at 15:14 UTC
Re^4: Pipe dream by tlm (Prior) on Sep 10, 2005 at 12:31 UTC
Re: Pipe dream by ambrus (Abbot) on Sep 09, 2005 at 11:07 UTC
You do know about `open("-\|")` and `open("\|-")`, don't you? However, I don't like to pipe data through real pipes between processes. I rather like real coroutines that pass values to each other, not just text through a pipe. Perl is not really good in this, because it doesn't have continuations and coroutines, and it's very difficult to add them to the perl core now. If you want that, you can use one of python which has coroutines and transparent lazy lists based on them, perl6 which will have them too but it's not ready yet, scheme has continuations built in, and some scheme implementations and libraries have even more support ruby, which has a callback-based iterator modell in its core classes; but also continuations and some libraries based on it lua which has coroutines	[reply] [d/l] [select]
Re: Pipe dream by Roy Johnson (Monsignor) on Sep 09, 2005 at 14:32 UTC
As you mention, it should be do-able with iterators rather than requiring interprocess pipes. Some idle doodling came up with a syntax that appeals to me and might appeal to you. # chain_lines takes a list of coderefs # The first sub in the chain is called with no input; subsequent one +s are called # with $_ set to the line from the previous sub; the iterator return +s one line of # output from the last sub in the chain. If a sub yields undef, proc +essing # restarts at the first sub. When the first sub yields undef, the it +erator is done. my $i = chain_lines sub { <$in> }, sub { /^42/ ? $_ : undef }, # grep sub { join "\t", (split /\t/)[1,3,8] }; # all_lines expands the iterator into a list of lines sort all_lines($i); [download] The grep case might warrant its own special syntax, like `grep_lines {/^42/}` [download] Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^2: Pipe dream by diotalevi (Canon) on Sep 09, 2005 at 18:03 UTC
If you're going to write a grep iterator, have the return value be an empty list on element failure. This allows failures to be removed from the list instead of just being substituted with undefined values. In fact, grep is just a special case of map. `grep /^42/ map { /^42/ ? $_ : () } sub { /^42/ ? $_ : () }` [download]	[reply] [d/l]
Re^3: Pipe dream by Roy Johnson (Monsignor) on Sep 09, 2005 at 18:09 UTC
`undef` is a special value in the scheme I've proposed. All results are taken in scalar context (assigned to `$_`), since this is a line-piping scheme. If the grep step returns `undef`, the iterator goes back to the first sub to try a new line. `undef`s do not show up in the output stream. You could use an empty list as your no-line-returned indicator, but it would require you evaluating the subs in list context and then turning the returned value into a scalar afterward. That would be problematic for some common pipe actions, like reading one line at a time from a file (in list context, the whole file would be read). Caution: Contents may have been coded under pressure.	[reply] [d/l] [select]
Re^4: Pipe dream by diotalevi (Canon) on Sep 09, 2005 at 18:10 UTC
Re: Pipe dream by ruoso (Curate) on Sep 09, 2005 at 18:29 UTC
I recently got a similar problem. I needed to export all the data from a mdb (MS Access) file, but I don't use Windows, and don't have Access. So I found the mdbtools software that knows how to list the tables and how to export a table. So my idea was running `mdb-export -d , file.mdb tablename \| gzip -9 -c > file.csv.gz` [download] for each table of each file. But the problem was that some tables did have spaces in the name, then it was starting to be harder to do on plain shell, because it was splitting each part of the name of the table and using as a single name... So I decided to do it in Perl, and ended with the following code. use IPC::Open2; use strict; opendir DIR, "."; my @bancos = grep { /.mdb$/ } readdir DIR; closedir DIR; foreach my $db (@bancos) { chomp $db; print "$db\n"; open TABLES, "mdb-tables -1 $db \|" \|\| die $!; my @tables = <TABLES>; close TABLES; my $dbdir = $db; $dbdir =~ s/\.mdb$//; mkdir $dbdir \|\| die $!; foreach my $table (@tables) { chomp $table; open('MDBEXPOUT', "mdb-export -d , '$db' '$table' \|") +\|\| die $!; $table =~ s/\W/_/g; open('OUTFILE', ">$dbdir/$table.csv.gz") \|\| die $!; open2('>&OUTFILE', '<&MDBEXPOUT', "gzip", "-9", "-c") +\|\| die $!; wait; close MDBEXPOUT; close OUTFILE; print "$dbdir/$table\n"; } } [download] This code uses the pipes just like the shell would use... I was thinking in creating a module to make it easier to chain pipes like this, but, hmmmm... I'm not sure if it's a good idea... daniel	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks