seek() functionality on pipes

HKS has asked for the wisdom of the Perl Monks concerning the following question:

SOLVED: Via the pseek function posted by BrowserUK:

use constant CHUNK => 4*1024;
sub pseek {
   my( $p, $o ) = @_;
   read( $p, my $discard, $CHUNK ), $o -= $CHUNK while $o > $CHUNK;
   read( $p, $discard, $o-1 );
   return $o;
}
[download]

Thanks to all for your help.

-------------------------------------

Short:

Is there a way to achieve seek()-type functionality on pipe output?

Long:

A project I'm working on reads files from a given offset. For bland text files, this is simple:

open(FILE, $path) || die "$!\n";
seek(FILE, $offset, 0);
while(<FILE>) {
    # do stuff
}
close FILE;
[download]

However, it also needs to be able to read compressed files. Rather than hardcoding each compression format handler, I'd like to just add a configuration directive that points my program at the appropriate cat tool for the format (zcat, bzcat, whatever happens to be relevant) and open it like this:

open(FILE, '-|', "$cat $file") || die "$!\n";
[download]

But as you all know, I can't seek() on a pipe.

How can I work around this? I could dump the output to a file and then read it back in, but this is horribly inefficient and will cause significant performance problems as the files can reach 300-500 MB.

Thanks for any help.

Comment on seek() functionality on pipes Select or Download Code

Replies are listed 'Best First'.
Re: seek() functionality on pipes by BrowserUk (Patriarch) on Jul 21, 2008 at 18:06 UTC
So long as you only need to go forward and always relative to the start of the file, then just discard as many bytes as necessary to reach the point you want: `use constant CHUNK => 4*1024; sub pseek { my( $p, $o ) = @_; read( $p, my $discard, $CHUNK ), $o -= $CHUNK while $o > $CHUNK; read( $p, $discard, $o-1 ); return $o; }` [download] If you need to do relative or backwards seeks, you're pretty much out of luck unless you can afford to read the whole file into a scalar and then open that scalar as a file: `open MEM, '+<', \$bigscalar or die $!;` In which case you can treat the result just as you would a normal file. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^2: seek() functionality on pipes by ikegami (Patriarch) on Jul 21, 2008 at 19:08 UTC
`read` is not guaranteed to read the number of bytes you specify, especially when dealing with handles which aren't tied to files. I also added error handling and made the buffer size configurable. `use constant BLK_SIZE => 161024; sub pseek { my( $p, $to_read, $blk_size ) = @_; $blk_size \|\|= BLK_SIZE; while ( $to_read ) { $blk_size = $to_read if $to_read < $blk_size; my $read = read( $p, my $discard, $blk_size ); return $read if !$read; $to_read -= $read; } return 1; }` [download] Update*: Or maybe not. My testing shows that `read` does wait, but its documentation uses the same wording as `sysread` which does not. As such, I wouldn't count on the observed bahviour. `$ perl -e'$\|=1; print "a"; sleep(10); print "b"' \| perl -le'read(STDIN +, $buf, 10); print $buf' ab $ perl -e'$\|=1; print "a"; sleep(10); print "b"' \| perl -le'sysread(ST +DIN, $buf, 10); print $buf' a` [download] Same results on linux and Windows.	[reply] [d/l] [select]
Re^3: seek() functionality on pipes by BrowserUk (Patriarch) on Jul 21, 2008 at 19:22 UTC
Yes. Also, depending upon the OPs reqs, it might be better to use sysread rather than read. Most file format specs are in terms of bytes not chars. I'm never quite sure whether Perl will start treating input as unicode without a specific request on an open to do so? For example, does it recognise BOMs in an input stream and act upon them? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^4: seek() functionality on pipes by ikegami (Patriarch) on Jul 21, 2008 at 19:34 UTC
Re^5: seek() functionality on pipes by BrowserUk (Patriarch) on Jul 21, 2008 at 19:46 UTC
Some notes below your chosen depth have not been shown here
Re^2: seek() functionality on pipes by HKS (Acolyte) on Jul 21, 2008 at 18:58 UTC
The pseek() function is pretty much what I was looking for - thanks. The performance isn't great, but it allows me the flexibility to use whatever compression format I like without having to decompress to a file, read the new file in, and then remove it. Thanks for the help.	[reply]
Re^3: seek() functionality on pipes by BrowserUk (Patriarch) on Jul 21, 2008 at 19:32 UTC
See ikegami's improvements above. Also my comments about using sysread rather than read which still seems to give a substantial performance improvement on my system at least. I don't think there is much that can be done about the performance. Increasing the read chunk size probably won't benefit much as you are going to be limited by whatever buffers the system allocates to the pipe--seems to be about 4k on my system. One thing that may improve it, even though it is counter intuative, is to insert a brief sleep after each read in the loop. Especially if the read did not return a full buffer. If the producing process is slightly slow, then attempting to read again too quickly is pointless, as there may be nothing, or less than a full buffer load available to read, and you could end up reading a few bytes each time with a task switch required in between to permit the producer to produce some more. By adding a short sleep, even a `sleep 0;` may be enough, if a read fails to fill the buffer, could improve throughput markedly. Something to experiment with on the target system and producer program. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re: seek() functionality on pipes by RMGir (Prior) on Jul 21, 2008 at 18:07 UTC
It's not seek for pipes in general, but if all you want is seek for gunzip, check out IO::Uncompress::AnyInflate's documentation, which is part of IO-Compress-zLib. At a quick search.cpan.org glance, it seems to do what you want, although I'm not sure how efficient seeking is as compared to just reading and throwing away n bytes. It's very unlikely to be worse, and might just be a lot better if there's a way to skip some uncompressing by reading metadata, so it's worth a try... Mike	[reply]
Re^2: seek() functionality on pipes by HKS (Acolyte) on Jul 21, 2008 at 19:02 UTC
I chose not to go with the Perl decompression libraries because I want to allow for multiple formats (bzip2, gzip, zip, etc) without a whole lot of extra code.	[reply]
Re^3: seek() functionality on pipes by RMGir (Prior) on Jul 22, 2008 at 11:23 UTC
Reasonable. Although note that IO::Uncompress::AnyInflate supports zip and gzip, so you might be able to use that for those formats if it seeks faster, and fall back to a pipe/pseek solution for bzip2. Mike	[reply]
Re: seek() functionality on pipes by Fletch (Bishop) on Jul 21, 2008 at 17:54 UTC
That's pretty much the only option if you need the entire file. If you only need to be able to seek back within a smaller window you can implement your own buffering reads and move your "filepointer" within your buffer back and forth (I believe this is how `less` implements being able to page back on piped input), but as you point out both approaches have overhead issues. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: seek() functionality on pipes by zentara (Archbishop) on Jul 21, 2008 at 18:19 UTC
On linux, you can see how much data is in the pipe, you might be able to sysread the first few bytes, to get the compression header, save those bytes, then redirect it( and the rest of the sysread chunks) to the appropriate decompessor pipe's filehandle. See IPC3 buffer limit problem and look at perldoc -q 'character waiting'. I'm not really a human, but I play one on earth Remember How Lucky You Are	[reply]
Re^2: seek() functionality on pipes by HKS (Acolyte) on Jul 21, 2008 at 19:03 UTC
Thanks for the pointer. I won't be able to use that in this particular project due to portability issues, but it'll be handy sometime in the future.	[reply]
Re: seek() functionality on pipes by salva (Canon) on Jul 21, 2008 at 18:07 UTC
As far as you only need to seek forward, you can use IO::Uncompress::Gunzip, that supports the (forward only) seek method.	[reply]
Re: seek() functionality on pipes by sgifford (Prior) on Jul 22, 2008 at 03:07 UTC
Net::FTP::RetrHandle on CPAN has some code that might be useful. It emulates a seekable filehandle from an FTP server by a combination of skipping over bytes, doing partial transfers, and restarting the transfer when necessary. You could do something similar: skip bytes to seek forward, and to seek backwards start over and then seek forward. -- sgifford's Web page	[reply]


laziness, impatience, and hubris
	PerlMonks