Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re: Data compression by 50% + : is it possible?

by davido (Cardinal)
on May 12, 2019 at 02:24 UTC ( [id://1233628]=note: print w/replies, xml ) Need Help??


in reply to Data compression by 50% + : is it possible?

If no number can be repeated more than once per line then each line can be represented in 90 bits. Bit 0 represents 1, bit 1 represents 2, and bit 89 (the 90th bit) represents 90. A 1 in bit 19 means that there was a 20 in the data set. You said you don't need to preserve order (it would be cool, right but not a requirement).

Reconstructing the data set will provide the values in numeric order.

Here's a proof of concept:

use strict; use warnings; use feature qw(say); use List::Util qw(shuffle); use Data::Compare qw(Compare); sub serialize { my $bv; vec($bv,$_ - 1,1) = 1 for @_; return $bv; } sub deserialize { return grep {vec($_[0], $_-1, 1)} 1 .. 90; } for (1..10) { my @list = (shuffle(1..90))[0..79]; my $vector = serialize(@list); print "length: ", length($vector), ' bytes, ', length($vector) * 8 +, ' bits.'; my @deserialized = deserialize($vector); say "\tThe data ", (Compare(\@deserialized, [sort {$a <=> $b} @lis +t]) ? 'matches.' : 'do not match.'); }

Each 80 column line could not take more than 96 bits (we waste 6 just to keep it simple).

Update: If the OP's data looks like what is generated (but with the additional constraint of not having the same value more than one time on a given line) this solution works, but may not achieve an adequate level of compression. Essentially, the longer the line, the better the ratio. The longest any line could be without repeating any ordinal value from 1 to 90 would be 90 characters. In that case, we can achieve almost an 7.5:1 compression ratio using a bit vector and throwing order out the window. But the shorter the input line, the lower the ratio would be. If the line averages twelve characters we have no compression, but still lose order. If the line averages ten characters, we have expansion. So for a 2:1 compression ratio the average line length would need to be 24 characters using my technique.

Here's a breakdown:

length | original bit size | stored bit size | compression ratio
----------------------------------------------------------------
1      | 8                 | 96              | 1:12 (expansion)
4      | 32                | 96              | 1:3  (expansion)
6      | 48                | 96              | 1:2  (expansion)
12     | 96                | 96              | 1:1
14     | 112               | 96              | 1.16:1 (14% compression)
24     | 192               | 96              | 2:1  (50% compression)
36     | 288               | 96              | 3:1
48     | 384               | 96              | 4:1  (75% compression)
60     | 480               | 96              | 5:1
72     | 576               | 96              | 6:1
84     | 672               | 96              | 7:1
90(max)| 720               | 96              | 7.5:1

So break-even is 12 values. 50% size reduction comes at 24 values. 75% size reduction comes at 48 values per line. This solution will only meet the OP's requirement of a 50% size reduction if the average line length is values. As an artifact of using 96 bits to represent 90 values, we have six bits left over to use to encode newline, and even undef, if we want. Also, if the OP's data set does contain newlines (which we can encode as bit 90, the newline gets a 50% reduction, but still fits within the 96 bits, so serves to improve our compression ratio. If the average line length is 12, plus a newline (which is two characters), we are compressing 14 characters (112 bits) into 96 bits of space, so 1.16:1 compression. Nothing to write home about.


Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1233628]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (5)
As of 2024-04-25 08:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found