Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

seek() functionality on pipes

by HKS (Acolyte)
on Jul 21, 2008 at 17:40 UTC ( [id://699086]=perlquestion: print w/replies, xml ) Need Help??

HKS has asked for the wisdom of the Perl Monks concerning the following question:

SOLVED: Via the pseek function posted by BrowserUK:

use constant CHUNK => 4*1024; sub pseek { my( $p, $o ) = @_; read( $p, my $discard, $CHUNK ), $o -= $CHUNK while $o > $CHUNK; read( $p, $discard, $o-1 ); return $o; }

Thanks to all for your help.

-------------------------------------

Short:

Is there a way to achieve seek()-type functionality on pipe output?

Long:

A project I'm working on reads files from a given offset. For bland text files, this is simple:

open(FILE, $path) || die "$!\n"; seek(FILE, $offset, 0); while(<FILE>) { # do stuff } close FILE;

However, it also needs to be able to read compressed files. Rather than hardcoding each compression format handler, I'd like to just add a configuration directive that points my program at the appropriate cat tool for the format (zcat, bzcat, whatever happens to be relevant) and open it like this:

open(FILE, '-|', "$cat $file") || die "$!\n";

But as you all know, I can't seek() on a pipe.

How can I work around this? I could dump the output to a file and then read it back in, but this is horribly inefficient and will cause significant performance problems as the files can reach 300-500 MB.

Thanks for any help.

Replies are listed 'Best First'.
Re: seek() functionality on pipes
by BrowserUk (Patriarch) on Jul 21, 2008 at 18:06 UTC

    So long as you only need to go forward and always relative to the start of the file, then just discard as many bytes as necessary to reach the point you want:

    use constant CHUNK => 4*1024; sub pseek { my( $p, $o ) = @_; read( $p, my $discard, $CHUNK ), $o -= $CHUNK while $o > $CHUNK; read( $p, $discard, $o-1 ); return $o; }

    If you need to do relative or backwards seeks, you're pretty much out of luck unless you can afford to read the whole file into a scalar and then open that scalar as a file:

    open MEM, '+<', \$bigscalar or die $!;

    In which case you can treat the result just as you would a normal file.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      read is not guaranteed to read the number of bytes you specify, especially when dealing with handles which aren't tied to files. I also added error handling and made the buffer size configurable.

      use constant BLK_SIZE => 16*1024; sub pseek { my( $p, $to_read, $blk_size ) = @_; $blk_size ||= BLK_SIZE; while ( $to_read ) { $blk_size = $to_read if $to_read < $blk_size; my $read = read( $p, my $discard, $blk_size ); return $read if !$read; $to_read -= $read; } return 1; }

      Update: Or maybe not. My testing shows that read does wait, but its documentation uses the same wording as sysread which does not. As such, I wouldn't count on the observed bahviour.

      $ perl -e'$|=1; print "a"; sleep(10); print "b"' | perl -le'read(STDIN +, $buf, 10); print $buf' ab $ perl -e'$|=1; print "a"; sleep(10); print "b"' | perl -le'sysread(ST +DIN, $buf, 10); print $buf' a

      Same results on linux and Windows.

        Yes. Also, depending upon the OPs reqs, it might be better to use sysread rather than read. Most file format specs are in terms of bytes not chars.

        I'm never quite sure whether Perl will start treating input as unicode without a specific request on an open to do so? For example, does it recognise BOMs in an input stream and act upon them?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      The pseek() function is pretty much what I was looking for - thanks. The performance isn't great, but it allows me the flexibility to use whatever compression format I like without having to decompress to a file, read the new file in, and then remove it. Thanks for the help.

        See ikegami's improvements above. Also my comments about using sysread rather than read which still seems to give a substantial performance improvement on my system at least.

        I don't think there is much that can be done about the performance. Increasing the read chunk size probably won't benefit much as you are going to be limited by whatever buffers the system allocates to the pipe--seems to be about 4k on my system.

        One thing that may improve it, even though it is counter intuative, is to insert a brief sleep after each read in the loop. Especially if the read did not return a full buffer.

        If the producing process is slightly slow, then attempting to read again too quickly is pointless, as there may be nothing, or less than a full buffer load available to read, and you could end up reading a few bytes each time with a task switch required in between to permit the producer to produce some more.

        By adding a short sleep, even a sleep 0; may be enough, if a read fails to fill the buffer, could improve throughput markedly. Something to experiment with on the target system and producer program.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: seek() functionality on pipes
by RMGir (Prior) on Jul 21, 2008 at 18:07 UTC
    It's not seek for pipes in general, but if all you want is seek for gunzip, check out IO::Uncompress::AnyInflate's documentation, which is part of IO-Compress-zLib.

    At a quick search.cpan.org glance, it seems to do what you want, although I'm not sure how efficient seeking is as compared to just reading and throwing away n bytes. It's very unlikely to be worse, and might just be a lot better if there's a way to skip some uncompressing by reading metadata, so it's worth a try...


    Mike
      I chose not to go with the Perl decompression libraries because I want to allow for multiple formats (bzip2, gzip, zip, etc) without a whole lot of extra code.
        Reasonable. Although note that IO::Uncompress::AnyInflate supports zip and gzip, so you might be able to use that for those formats if it seeks faster, and fall back to a pipe/pseek solution for bzip2.

        Mike
Re: seek() functionality on pipes
by Fletch (Bishop) on Jul 21, 2008 at 17:54 UTC

    That's pretty much the only option if you need the entire file. If you only need to be able to seek back within a smaller window you can implement your own buffering reads and move your "filepointer" within your buffer back and forth (I believe this is how less implements being able to page back on piped input), but as you point out both approaches have overhead issues.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: seek() functionality on pipes
by zentara (Archbishop) on Jul 21, 2008 at 18:19 UTC
    On linux, you can see how much data is in the pipe, you might be able to sysread the first few bytes, to get the compression header, save those bytes, then redirect it( and the rest of the sysread chunks) to the appropriate decompessor pipe's filehandle. See IPC3 buffer limit problem and look at perldoc -q 'character waiting'.

    I'm not really a human, but I play one on earth Remember How Lucky You Are
      Thanks for the pointer. I won't be able to use that in this particular project due to portability issues, but it'll be handy sometime in the future.
Re: seek() functionality on pipes
by salva (Canon) on Jul 21, 2008 at 18:07 UTC
    As far as you only need to seek forward, you can use IO::Uncompress::Gunzip, that supports the (forward only) seek method.
Re: seek() functionality on pipes
by sgifford (Prior) on Jul 22, 2008 at 03:07 UTC
    Net::FTP::RetrHandle on CPAN has some code that might be useful. It emulates a seekable filehandle from an FTP server by a combination of skipping over bytes, doing partial transfers, and restarting the transfer when necessary. You could do something similar: skip bytes to seek forward, and to seek backwards start over and then seek forward.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://699086]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found