Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

If no number can be repeated more than once per line then each line can be represented in 90 bits. Bit 0 represents 1, bit 1 represents 2, and bit 89 (the 90th bit) represents 90. A 1 in bit 19 means that there was a 20 in the data set. You said you don't need to preserve order (it would be cool, right but not a requirement).

Reconstructing the data set will provide the values in numeric order.

Here's a proof of concept:

use strict; use warnings; use feature qw(say); use List::Util qw(shuffle); use Data::Compare qw(Compare); sub serialize { my $bv; vec($bv,$_ - 1,1) = 1 for @_; return $bv; } sub deserialize { return grep {vec($_[0], $_-1, 1)} 1 .. 90; } for (1..10) { my @list = (shuffle(1..90))[0..79]; my $vector = serialize(@list); print "length: ", length($vector), ' bytes, ', length($vector) * 8 +, ' bits.'; my @deserialized = deserialize($vector); say "\tThe data ", (Compare(\@deserialized, [sort {$a <=> $b} @lis +t]) ? 'matches.' : 'do not match.'); }

Each 80 column line could not take more than 96 bits (we waste 6 just to keep it simple).

Update: If the OP's data looks like what is generated (but with the additional constraint of not having the same value more than one time on a given line) this solution works, but may not achieve an adequate level of compression. Essentially, the longer the line, the better the ratio. The longest any line could be without repeating any ordinal value from 1 to 90 would be 90 characters. In that case, we can achieve almost an 7.5:1 compression ratio using a bit vector and throwing order out the window. But the shorter the input line, the lower the ratio would be. If the line averages twelve characters we have no compression, but still lose order. If the line averages ten characters, we have expansion. So for a 2:1 compression ratio the average line length would need to be 24 characters using my technique.

Here's a breakdown:

length | original bit size | stored bit size | compression ratio
----------------------------------------------------------------
1      | 8                 | 96              | 1:12 (expansion)
4      | 32                | 96              | 1:3  (expansion)
6      | 48                | 96              | 1:2  (expansion)
12     | 96                | 96              | 1:1
14     | 112               | 96              | 1.16:1 (14% compression)
24     | 192               | 96              | 2:1  (50% compression)
36     | 288               | 96              | 3:1
48     | 384               | 96              | 4:1  (75% compression)
60     | 480               | 96              | 5:1
72     | 576               | 96              | 6:1
84     | 672               | 96              | 7:1
90(max)| 720               | 96              | 7.5:1

So break-even is 12 values. 50% size reduction comes at 24 values. 75% size reduction comes at 48 values per line. This solution will only meet the OP's requirement of a 50% size reduction if the average line length is values. As an artifact of using 96 bits to represent 90 values, we have six bits left over to use to encode newline, and even undef, if we want. Also, if the OP's data set does contain newlines (which we can encode as bit 90, the newline gets a 50% reduction, but still fits within the 96 bits, so serves to improve our compression ratio. If the average line length is 12, plus a newline (which is two characters), we are compressing 14 characters (112 bits) into 96 bits of space, so 1.16:1 compression. Nothing to write home about.


Dave


In reply to Re: Data compression by 50% + : is it possible? by davido
in thread Data compression by 50% + : is it possible? by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 17:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found