Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Uncompress streaming gzip on the fly in LWP::UserAgent/WWW::Mechanize

by bliako (Monsignor)
on Dec 19, 2018 at 01:53 UTC ( [id://1227437]=note: print w/replies, xml ) Need Help??


in reply to Solved: Uncompress streaming gzip on the fly in LWP::UserAgent/WWW::Mechanize

Could it be that the compressed stream you are receiving is a sequence of self-contained compressed chunks of data? In which case, IO::Uncompress::Gunzip can detect end of compressed chunk and reset, see "An advanced tip" in https://www.perl.com/article/162/2015/3/27/Gzipping-data-directly-from-Perl/ (by brian d foy) re: MultiStream

On a side note, do infinite streams of ([g]zip) compressed data exist?

bw, bliako

Replies are listed 'Best First'.
Re^2: Uncompress streaming gzip on the fly in LWP::UserAgent/WWW::Mechanize
by kschwab (Vicar) on Dec 19, 2018 at 22:17 UTC
    "On a side note, do infinite streams of (gzip) compressed data exist?"
    If you need one:
    $ gzip -f - </dev/urandom

      Thanks! Second question: does than unzip (before the end of time)?

        warning, makes "lotsafiles"
        $ mkdir lotsafiles $ cd lotsafiles $ gzip -f - </dev/urandom | gunzip | split -b 1024 & $ ls
Re^2: Uncompress streaming gzip on the fly in LWP::UserAgent/WWW::Mechanize
by Your Mother (Archbishop) on Dec 19, 2018 at 13:28 UTC

    Thanks for looking. It's not self-contained chunks; immediately at least. I can gunzip the first chunk in isolation fine and then next is gibberish. But if I concat the first and second, they gunzip fine. Probably there is some point, some bigger chunk, where it starts over as you suggest; IIRC there was a mention of 32kB somewhere. I tried the MultiStream settings, and other options, in my many experiments. I was definitely doing something wrong though. I'll dig back in.

      I would investigate what "gibberish" is, and whether Gunzip fails on that data or it does uncompresses it but what you get is "gibberish". If gunzip does not fail then it is possible to be sometimes have zip-inside-zip.

      So, they have a logical chunk of data, based on the XML I saw in their page <quote>...</quote> and then they have a logical chunk of compressed data of 32kB? Isn't that weird? I mean they compress 5 chunks of data and sometimes it is 32kB and sometimes it is 33kB depending what content they have. How can they always send 32kB and expect the recipient to get exactly 5 chunks of data? Unless they send sometimes 4 chunks, sometimes 5 and most times something fractional in-between. And if they do send something fractional, isn't it weird to cause you to waste time waiting for the remaining half chunk to appear (whenever the 32kB limit of the next chunk is filled)? You have something like "IBM up 2<end of chunk sorry>" and then you wait a few valuable seconds for the next chunk to find out if it is 2000 points or 2.4 points up!

      They can also do it if they pad of course but what's the point for all this computational burden on their side and forcing the client to wait till 32kB of compressed data have been completed before knowing where the market goes?

      Just thinking out loud...

        Thanks again for thinking about it at all, out loud or otherwise. :P

        The 32kB is just something I saw somewhere about gzip streams. I don't remember where, I probably shouldn't have mentioned it.

        If I do this (assume proper var scoping)–

        gunzip \$data => \$out; print $out, $/;

        –it will display something like–

        <status>connected</status> ?R??0 ????l??????@? +U?&#1964;??/?%y???p???v?Po#[???-???x? >\'&#1000;??4'?V.6?6?&#1444;~5Y???0???C]?$?@m~OgQ?u&#451;8?Y?E?8<?Le?4 +?6??&#1644;&qd?x#1

        Amended to–

        $collected .= $data; gunzip \$collected, \$out; print $out, $/;

        We get (it's ignoring the Accept and returning XML)–

        <status>connected</status> <quote> <ask>166.29</ask> <asksz>500</asksz> <bid>166.26</bid> … </quote> ...

        And then dies after awhile, it's inconsistent where but never sooner than 5kB in, with an "unexpected end" style message.

        Adding this lets it run for—maybe, I didn't let it run that long—forever, but it's still stacking up an ever growing scalar and gunzipping the same data over and over–

        $collected .= $data; gunzip \$collected, \$out, MultiStream => 1; print $out, $/;

        I expect I will have to come up with seek/tell/truncate kind of solution to keep the data from growing forever that uses the MultiStream to reset itself automatically. Haven't had time to go back to it. I feel like this must be a solved problem and I'm just looking in the wrong place. :|

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1227437]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-03-29 05:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found