http://qs321.pair.com?node_id=735015

Lexicon has asked for the wisdom of the Perl Monks concerning the following question:

This isn't exactly a perl question, but I'm hoping there's a perl answer.

I've written a perl wrapper around the ProFit (protein structural alignment) program because I want to cross compare a couple hundred PDBs against each other. I open the program in open3 (or open2, actually) at the start of my code, print lots of commands to the program, and read the output as needed, right? This works great under OS X, but the same code blocks under linux.

I find that when I quit ProFit in linux, it will spew its buffer, so I currently am opening/closing the program thousands of times, but obviously it's inefficient. If you open ProFit on the terminal, it buffers as it should, responding with data after each command.

So how do I unbuffer the stdout from that program? Is there a better Module to use, like IPC::Run or Expect? Perhaps an autoflush like flag or a terminal call?

Thanks a ton guys!

Replies are listed 'Best First'.
Re: open3 buffering in linux vs. os x
by tilly (Archbishop) on Jan 08, 2009 at 22:08 UTC
    You are Suffering from Buffering. You can, as zentara said, make your filehandles unbuffered. That is a good first step. If it works, then great. However there is a real possibility, and this could well be OS dependent, that ProFit itself is detecting whether it is on a terminal, and buffers output if it is not. In that case you will need to provide it with a terminal. Which Expect can allow you to do.
      Yeah, zentara's trick didn't work, so I suspect ProFit is being clever. I'm going to just let the script run for this round but I'll look into Expect when I have more time. Thanks!
Re: open3 buffering in linux vs. os x
by zentara (Archbishop) on Jan 08, 2009 at 21:03 UTC
    You might want to try and set the output pipe filehandle to non-blocking on linux.
    use Fcntl; # set non blocking and unbuffered my $flags = fcntl( OUTHANDLE, F_GETFL, 0 ); cntl( OUTHANDLE, F_SETFL, $flags | O_NONBLOCK ); select((select(OUTHANDLE), $|=1)[0]);

    I'm not really a human, but I play one on earth Remember How Lucky You Are
      Thanks! That was exactly what I was looking for. Alas, it didn't work. It blocks until QUIT like before. I suspect ProFit is just being clever, detected that it's not in a terminal, and is running in some kind of script mode.
        You might try IO::Pty to fake it out. Here is a simple example, but you might want to google for better ones.
        #!/usr/bin/perl use IO::Handle; use IO::Pty; sub do_cmd() { my $pty = new IO::Pty; defined( my $child = fork ) or die "Can't fork: $!"; if ($child){ $pty->close_slave(); #needed to close return $pty; } POSIX::setsid(); my $slave = $pty->slave; close($pty); STDOUT->fdopen( $slave, '>' ) || die $!; STDERR->fdopen( \*STDOUT, '>' ) || die $!; system("echo This is stdout output from an external program"); exit 0; } my $fh = do_cmd(); while (<$fh>) { print; }

        I'm not really a human, but I play one on earth Remember How Lucky You Are
Re: open3 buffering in linux vs. os x
by BrowserUk (Patriarch) on Jan 08, 2009 at 23:50 UTC

    Not a *nix user here, so your mileage might vary, but there is a trick I use successfully to cause interactive programs to flush their buffers under Win32 that might work for you.

    Most interactive programs have some kind of 'status' command with a regular and easily recognisable output format. The trick is to issue one or more of these when you've reached a point that you need to know that you're seeing all the output from the previous command or batch of commands.

    A quick look at the ProFit docs shows that it has a 'comment' command. Any line entered that starts with '#' is simply echoed to stdout. Which is great, as it is easy for your read loop to simple discard any line that starts with '#'.

    With a little investigation, you can even determine the size of the output buffer and simply do print $writeFH '#' x $bufferSize;, and know that whatever output precedes that will have been flushed through after you've issued it.

    It's not exactly a cool solution, but it is easy to implement and try.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Heh, I was actually playing with the comment, status, and some other commands to see if they would flush the buffer automagically, but they didn't work. However, I didn't go so far as to try to overflow the buffer, but now that you mention it that might be an easy trick to pull off.

      Comes back 30 seconds later...

      The buffer is 4KB. Problem solved!

      Interesingly, only works if I unbuffer the output using fctrl.
Re: open3 buffering in linux vs. os x
by bruno (Friar) on Jan 09, 2009 at 01:30 UTC
    What a small small world this is! Last month I needed to do something similar using ProFit. I didn't see the need to wrap the program with Perl though, since it's a command line application that supports scripting (you can pass your profit script using the -f flag).

    So, to align an arbitrary number of structures, I created a text file that only said "fit", and then I looped using bash:

    for F in *.pdb; do profit -f script.txt complex.1.pdb $F \\ >> output.txt; done
    Then I used grep to just keep the RMSD value of the run for each structure:
    cat output.txt | grep 'RMS' | awk '{print $2}' > rmsd.txt
    Of course, for anything slightly more complicated than that, I see why you'll want to use Perl.

    If you think it's worth it, consider writing a module under Bio::Tools::Run, so that you can later share your wrapper with everyone. You should look at the Bio::Tools::WrapperBase module and subclass from it; it'll give you a nice interface with all the error checking built in. I could help you do it if you want to!

      I have triplicate simulations of several homologs of a protein. The goal is to compare the structures at certain time ranges across all simulations and record statistics (avg, min, stddev of RMSD) in different subgroups. Oh, and over different zones for alignment. I imagine almost every protein simulation group has written something similar at one point or another, and now it's my turn. :)

      Right not the code is very specific for my situation, but I'm planning to generalize it for my lab after I understand the results a little better. I don't know if it would generalize well past that... over half the code is just parsing through which PDBs I have, giving them titles, organizing timepoints, etc... The part that deals with profit is basically one loop and one subroutine.

      I'll be happy to email a copy to you if you're interested, but I don't think you'll find it very enlightening. Here's the core subroutine. $filets are lists of PDBs to be compared. $ranges are the zones to use corresponding to each set. All the hard stuff is the bookkeeping, which will sadly be very user dependent.
      sub blast_filets { my ($PF_READ, $PF_WRITE); my $pid = open2($PF_READ, $PF_WRITE, $PROFIT) or die "Couldn't open pipe to profit. $!"; my $filets1 = $_[0]; my $filets2 = $_[1]; my $range1 = $_[2]; my $range2 = $_[3] || $range1; my @rmsd = (); my $count = 0; if ( @$range1 != @$range2 ) { die "Ranges do not have equivalent zone sizes."; } foreach my $f1 ( @$filets1 ) { print "REFERENCE $f1\n" if $VERBOSE >= 3; print $PF_WRITE "REFERENCE $f1\n"; foreach my $f2 ( @$filets2 ) { if ( $VERBOSE >= 3 ) { print "MOBILE $f2\n"; print "ZONE CLEAR\n"; print "ATOMS CA\n"; foreach my $i ( 0..$#$range1 ) { print "ZONE $range1->[$i][0]-$range1->[$i][1]" . ":$range2->[$i][0]-$range2->[$i][1]\n"; } print "FIT\n"; } print $PF_WRITE "MOBILE $f2\n"; print $PF_WRITE "ZONE CLEAR\n"; print $PF_WRITE "ATOMS CA\n"; foreach my $i ( 0..$#$range1 ) { print $PF_WRITE "ZONE $range1->[$i][0]-$range1->[$i][1]" . ":$range2->[$i][0]-$range2->[$i][1]\n"; } print $PF_WRITE "FIT\n"; print $PF_WRITE "\n\n\n\n\n\n\n\n\n\n"; } } print $PF_WRITE "QUIT\n"; my $result ; #print "Reading results\n"; while ( defined ($result = readline($PF_READ))) { #print "RESULT = $result\n" if $VERBOSE >= 2; if ($result =~ /RMS: ([\d\.]+)/m) { #print "Rmsd = $1\n"; push @rmsd, $1; $count++; } elsif ( $result =~ /Error/i ) { print "Error: $result\n"; } } my $result_wait = waitpid($pid, 0); if ( $result_wait != $pid ) { die "Waitpid returned $result_wait instead of $pid. $?."; } #print "DONE WITH RMSD\n"; my ( $rmsd, $stddev ) = Utility::Mean_and_Stddev (@rmsd); return ($rmsd, $stddev, $count, \@rmsd); }
        Thanks for posting that! Although the ProFit invocation there is not isolated from your case-specific code, it's still very useful to see how should the system calls be done. About the wrapper I was thinking about a cleaner, more OO interface. Something like:
        my $profitter = Bio::Tools::Run::ProFit->new( files => \@pdbfiles, reference => $pdbreference ); $profitter->fit; my %rmsds = $profitter->get_rmsds;
        But then, since I haven't used the program much, I don't know what else it could/should do. If I ever need to use it again, I'll get back to this thread and, with your permission, steal the portions of your code that successfully interact with ProFit.
Re: open3 buffering in linux vs. os x
by ikegami (Patriarch) on Jan 09, 2009 at 06:04 UTC

    Buffering is done at the application level, not at the system level, so you can't control whether another application (ProFit) buffers its output or not. Even if it's your child.

    Many applications (including perl), buffer STDOUT when it's not connected to a terminal. So the trick is to convince ProFit that it's connected to a terminal. That's where pseudo ttys come in.

      Let me add that some applications accept a command line flag to force interactive mode even when they are not connected to a tty.

      So Lexicon, before going the IO::Pty or Expect way, check in the application manual for that flag!

      BTW, ptys are not reliable in some operating systems as for instance AIX or HP-UX. You can overflow them and data will be silently dropped.

Re: open3 buffering in linux vs. os x
by zentara (Archbishop) on Jan 09, 2009 at 15:15 UTC
    If filling your 4k buffer on linux is the problem, you may be interested in IPC3 buffer limit problem. On linux, you can detect if the buffer has anything in it, and then sysread it out. See "perldoc -q waiting". Of course you have to go thru the hassle of running "h2ph", see "perldoc h2ph". Of course the problem may be your program isn't writing to the buffer, unless it has a 4k chunk ready, then this won't work.

    I'm not really a human, but I play one on earth Remember How Lucky You Are
      That solves an entirely different problem (which could be solved better using select). ProFit's data isn't waiting in the pipe, it's waiting the process's output buffer (a memory block in the C library).