Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Pipe dream

by tlm (Prior)
on Sep 09, 2005 at 01:30 UTC ( [id://490397]=perlmeditation: print w/replies, xml ) Need Help??

How come Unix's piping paradigm didn't make it into Perl? Or maybe it did and I didn't notice?

Yes, I know that one can open pipes like this:

open my $pipe, "foobar|" or die "$!\n"; print frobnicate( $_ ) while <$pipe>:
...but I have in mind something more integrated into Perl than that.

Specially after the introduction of lexical handles, I would like to be able to take a read handle and transform it somehow to modify its output.

For example, suppose the file foo.tsv consists of newline-separated records of tab-delimited fields, and I want to generate a "view" consisting of those records whose first field has the value 42. Furthermore, I only want fields 1, 3, and 8, and I want the resulting records to be sorted lexicographically. Finally, I want to put everything in foo_view.tsv. Easy:

{ open my $in, 'foo.tsv' or die "$!\n"; my @records; while ( <$in> ) { next unless /^42\t/; chomp; push @records, join( "\t", ( split "\t" )[ 1, 3, 8 ] ) . $/; } open my $out, '>', 'foo_view.tsv' or die "$!\n"; print $out $_ for sort @records; }

But here's a different way to think about this:

{ open my $in, 'foo.tsv' or die "$!\n"; $in = Filter::grepit( $in, qr/^42\t/ ); $in = Filter::cols ( $in, "\t", 1, 3, 8 ); $in = Filter::sortit( $in ); open my $out, '>', 'foo_view.tsv' or die "$!\n"; print $out $_ while <$in>; }
The function Filter::grepit takes an open read handle and a regex and returns a read handle that outputs only those records from the original handle that match the regex. The function Filter::cols takes an open read handle, a field delimiter, and a list of field numbers, and returns a record consisting of only those fields. Finally, Filter::sortit returns records in lexicographic order.

Admittedly, this code is not more succinct and not much clearer than in the first version, though, subjectively, I find it easier on the eye somehow. But the potential big win is in the fact that, in principle, to sort the records we no longer have to read all the records into a Perl array, which could take up a lot of memory. This problem is relegated to the implementation of sortit. Of course, sortit could end up doing precisely that behind the scenes, but it could do something else. For example, sortit could fork the job off to sort(1):

sub sortit { my ( $fh ) = shift; return pipeit( $fh, 'sort' ); } sub pipeit { my ( $fh, $cmd ) = @_; my $new_fh; return $new_fh if my $pid = open $new_fh, '-|'; die "Fork failed: $!\n" unless defined $pid; open my $pipe, "|$cmd" or die "Pipe failed: $!\n"; print $pipe $_ while <$fh>; exit 0; }

Now, even for huge files, we can let sort(1) handle the problem of creating intermediate sorted fragments, merging them, etc. I'm sure there are better ways to implement this kind of thing, but you get the idea.

Does anything like this already exist in CPAN? (The closest I've found is PerlIO layers, which I find pretty hard to use.)


PS: FWIW, here are implementations of grepit and cols:

sub grepit { my ( $fh, $keep ) = @_; my $new_fh; return $new_fh if my $pid = open $new_fh, '-|'; die "Fork failed: $!\n" unless defined $pid; my $re = ref $keep ? $keep : qr/\Q$keep/; /$re/ && print STDOUT while <$fh>; exit 0; } sub cols { my ( $fh, $sep, @cols ) = @_; my $new_fh; return $new_fh if my $pid = open $new_fh, '-|'; die "Fork failed: $!\n" unless defined $pid; print STDOUT join( $sep, ( split $sep )[ @cols ] ), "\n" while <$fh>; exit 0; }

the lowliest monk

Replies are listed 'Best First'.
Re: Pipe dream
by tilly (Archbishop) on Sep 09, 2005 at 03:57 UTC
    In Perl 6 your operator will exist and be known as ==>.
Re: Pipe dream
by jdporter (Paladin) on Sep 09, 2005 at 03:07 UTC
    I've often reflected on the fact that perl's (and other languages') equivalent of a pipe stream is stacked functions which take and return lists. (A canonical example can be found in the Schwartzian Transform.) The equivalence is even stronger when lists can be lazy. Languages like Haskell do it beautifully. Presumably Perl6 will too.
Re: Pipe dream
by Zaxo (Archbishop) on Sep 09, 2005 at 10:09 UTC

    There is pipe. Pipe is lower level than shell pipes so is more flexible. It is possible to set up arbitrary networks of processes with pipe, fork, and select. The complexity of such things prevents their use, but that is not a limit of Perl.

    It is pretty easy to set up a coprocess which you can print stuff to, and read back the result. Skeletal version:

    my ($pin,$pout,$cin,$cout,%kid); { pipe $pin, $cout; pipe $cin, $pout; my $cpid = fork; die $! unless defined $cpid; $kid{$cpid} = 1, last if $cpid; # parent close $pin or die $!; # in child to end of block close $pout or die $!; while (<$cin>) { # do filter-like stuff to $_ print $cout $_; } exit 0; } # parent close $cin or die $!; close $cout or die $!; # print to $pout and read from $pin, maybe in a select loop # depends on expected behavior of the child. delete $kid{wait()} while %kid;
    The complication of that could be wrapped in a module, and probably has been.

    After Compline,
    Zaxo

Re: Pipe dream
by chb (Deacon) on Sep 09, 2005 at 08:25 UTC
    Please have a look at Higher Order Perl by Dominus. Your pipes are called streams there and are implemented with infinite lists. They are not limited to filehandles as data source. Very interesting stuff.

      I'm quite familiar with streams, and with HOP's treatment of them. I think it would be exceedingly awkward to implement the example in my OP using streams. (For one thing, since I am sorting in the last step, I have to fully "actualize" each stream, which means that they become garden-variety (finite) linked lists, each one of them resident in memory.) What I'm talking about has much more to do with iterators, which Dominus also covers extensively in HOP.

      the lowliest monk

        Well, you can't sort infinite streams. I haven't read HOP, but usually streams and iterators are the same thing. Lazy lists, on the other hand, are nothing but memoized iterators. You can sort iterators without storing the entire thing in memory, but it requires multiple passes through contents of the iterator (N passes for the bubble sort, although you can trade this off for increased memory usage. That is you can do it in N/2 passes, if your bubble sort floats two items to the top at a time).
Re: Pipe dream
by ambrus (Abbot) on Sep 09, 2005 at 11:07 UTC

    You do know about open("-|") and open("|-"), don't you?

    However, I don't like to pipe data through real pipes between processes. I rather like real coroutines that pass values to each other, not just text through a pipe. Perl is not really good in this, because it doesn't have continuations and coroutines, and it's very difficult to add them to the perl core now. If you want that, you can use one of

    • python which has coroutines and transparent lazy lists based on them,
    • perl6 which will have them too but it's not ready yet,
    • scheme has continuations built in, and some scheme implementations and libraries have even more support
    • ruby, which has a callback-based iterator modell in its core classes; but also continuations and some libraries based on it
    • lua which has coroutines
Re: Pipe dream
by Roy Johnson (Monsignor) on Sep 09, 2005 at 14:32 UTC
    As you mention, it should be do-able with iterators rather than requiring interprocess pipes. Some idle doodling came up with a syntax that appeals to me and might appeal to you.
    # chain_lines takes a list of coderefs # The first sub in the chain is called with no input; subsequent one +s are called # with $_ set to the line from the previous sub; the iterator return +s one line of # output from the last sub in the chain. If a sub yields undef, proc +essing # restarts at the first sub. When the first sub yields undef, the it +erator is done. my $i = chain_lines sub { <$in> }, sub { /^42/ ? $_ : undef }, # grep sub { join "\t", (split /\t/)[1,3,8] }; # all_lines expands the iterator into a list of lines sort all_lines($i);
    The grep case might warrant its own special syntax, like
    grep_lines {/^42/}

    Caution: Contents may have been coded under pressure.

      If you're going to write a grep iterator, have the return value be an empty list on element failure. This allows failures to be removed from the list instead of just being substituted with undefined values. In fact, grep is just a special case of map.

      grep /^42/ map { /^42/ ? $_ : () } sub { /^42/ ? $_ : () }
        undef is a special value in the scheme I've proposed. All results are taken in scalar context (assigned to $_), since this is a line-piping scheme. If the grep step returns undef, the iterator goes back to the first sub to try a new line. undefs do not show up in the output stream.

        You could use an empty list as your no-line-returned indicator, but it would require you evaluating the subs in list context and then turning the returned value into a scalar afterward. That would be problematic for some common pipe actions, like reading one line at a time from a file (in list context, the whole file would be read).


        Caution: Contents may have been coded under pressure.
Re: Pipe dream
by ruoso (Curate) on Sep 09, 2005 at 18:29 UTC
    I recently got a similar problem. I needed to export all the data from a mdb (MS Access) file, but I don't use Windows, and don't have Access. So I found the mdbtools software that knows how to list the tables and how to export a table. So my idea was running
    mdb-export -d , file.mdb tablename | gzip -9 -c > file.csv.gz
    for each table of each file. But the problem was that some tables did have spaces in the name, then it was starting to be harder to do on plain shell, because it was splitting each part of the name of the table and using as a single name... So I decided to do it in Perl, and ended with the following code.
    use IPC::Open2; use strict; opendir DIR, "."; my @bancos = grep { /.mdb$/ } readdir DIR; closedir DIR; foreach my $db (@bancos) { chomp $db; print "$db\n"; open TABLES, "mdb-tables -1 $db |" || die $!; my @tables = <TABLES>; close TABLES; my $dbdir = $db; $dbdir =~ s/\.mdb$//; mkdir $dbdir || die $!; foreach my $table (@tables) { chomp $table; open('MDBEXPOUT', "mdb-export -d , '$db' '$table' |") +|| die $!; $table =~ s/\W/_/g; open('OUTFILE', ">$dbdir/$table.csv.gz") || die $!; open2('>&OUTFILE', '<&MDBEXPOUT', "gzip", "-9", "-c") +|| die $!; wait; close MDBEXPOUT; close OUTFILE; print "$dbdir/$table\n"; } }
    This code uses the pipes just like the shell would use... I was thinking in creating a module to make it easier to chain pipes like this, but, hmmmm... I'm not sure if it's a good idea...
    daniel

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://490397]
Approved by sparkyichi
Front-paged by itub
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-19 08:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found