Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: More details on compressing a set of integers

by BrowserUk (Patriarch)
on Jan 26, 2003 at 06:48 UTC ( [id://229950]=note: print w/replies, xml ) Need Help??


in reply to More details on compressing a set of integers
in thread Compressing a set of integers

Its worth pointing out that attempting to compress a string containing a list of comma delimited ascii numbers using Base64, will make the string grow in size, not shrink.

Eg. A string containing 1000 5 digit numbers separated by commas, is 6000 bytes. Encode that base64 and you get a string nearly 8K in length.

Pack the same string using 16-bits and you get 2000 bytes.

If you really need to make it smaller still, then you might consider using Compress::Zlib::memGzip which compresses the same ascii list into 1800 bytes.

There are caveats associated with this 10% reduction in size. A) your compression will vary with the data set. B) Compress::Zlib::memGunzip is 10 to 15 times slower than unpack.

If your main concern is that using pack & unpack with format 'S*', will limit you to 16-bit numbers, then move to 32-bit using 'L*'. This will pack the same 1000 numbers into 4000 bytes which makes it look not such a good option until you think about what happens when you start using 6-digit numbers.

The string containing 100,000 .. 101,000, comma delimited ascii is 7000 bytes. Base64:9467, zipped:1817, packed 32-bit 4000. Again, Zlib is looking good, but now the figures for a random mixed of 1 to 6-digit numbers

ascii:6458, base64:8726, zipped:3169, pack 32-bit:4000.

Suddenly the trade for decompression speed looks less valuable, and if you move to a random mix of 8-digit numbers, Zlib starts averaging 4500 bytes for 1000 numbers, pack remains at 4000 bytes.

Finally, with Zlib, there is a heavy penalty for small lists. Lists of 1 x 5-digits average around 25 bytes; 4 x 5-digits numbers around 45 bytes and 4 x 8-digit numbers around 75 bytes. The corresponding pack'd lists come out at 2/8 for 'S*' and 4/16/16 for 'L*'.


Examine what is said, not who speaks.

The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

  • Comment on Re: More details on compressing a set of integers

Replies are listed 'Best First'.
Re: Re: More details on compressing a set of integers
by toma (Vicar) on Jan 26, 2003 at 07:08 UTC
    Sorry, I should have mentioned that I didn't mean to use the Base64 module, I was considering writing the numbers in base 64, so that when I run out of 0-9 I go through a-z then A-Z, etc, so I use more of the bits of the characters, and use fewer digits to represent the same number.

    The penalty for compression on small strings is the dictionary that gets set up for the string. If I use a fixed dictionary, this shouldn't be a problem. So I need a customized version of the compression code.

    With compression, I shouldn't need to use base 64, and I'm guessing that I will get about a 60% reduction in size.

    Another optimization I thought of was to change the numbers to represent the differences between the numbers instead of the numbers themselves. So instead of

    1,10,45,50,120
    I would have
    1,9,35,5,70

    It should work perfectly the first time! - toma

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://229950]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-19 02:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found