http://qs321.pair.com?node_id=1233613

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a very simple dataset: ints from 1..90 that occur only once in a given line. They cannot be consecutive (meaning there is no sequence in a dataset). Example:

use strict; #never warnings; #daraset for (1..100){ for (my $x=0; $x<90; $x+=10){ my @c; push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); @c = sort{$a<=>$b}@c; for (my $i = 1; $i < @c; $i++){ print chr(33+$c[$i]) if $c[$i] != $c[$i-1] && +$c[$i] != $c[$i-1]+1 ; } } print "\n"; }
And what I am trying to do is to reduce the generated dataset by 50% or more (never got a headache from extra). It needs to be a lossless compression scheme and the order needs not to be preserved (necessarily) but it would be cool to have it. Is this possible? Any trick is allowed no matter how complicated or simple it is. if it gets 50%+, and the data can be reconstructed from it, the goal has been achieved.

Thank you !!

PS

no need for code if you do not feel like coding...

UPDATE:

Just to settle "is the code right" issue. The code is right. I have an automata that depending on some input spits out ints (coded exactly as i demonstrated -> ASCII 33+90) It does it 9 times in batches of 10's. sometimes it even skips the output batch but this is extremely rare. The output it produces is large (3-4GB) and i was just wondering if it was an option to shrink it. gzip or other compressors are not option as the system where it resides is old and does have bare minimum of additional tools on in. basically naked cpu with c and perl.

Replies are listed 'Best First'.
Re: Data compression by 50% + : is it possible?
by roboticus (Chancellor) on May 11, 2019 at 23:09 UTC

    baxy77bax:

    That depends on how much redundant information there is in your data. The more redundancy you have, the easier it is to squeeze it.

    So the first thing you ought to do is find out how random your data appears to be. I rand your script a few times, and it seems to generate about 15 characters per line.

    I did a little hacking on your inner loop and discovered that due to your if statements in the encoding of your array @c, there are only 50 different possibilities for it, which you could encode easily in 1 character (six bits). Since you've got 9 iterations of that loop, each line could be encoded as 9 characters. That's not quite enough to crunch out half the space, but it should get you a good start.

    The 50 different possibilities have a fairly non-uniform distribution, where the most common 4 should appear 25% of the time, and the most common 9 appear over 50% of the time, so there's definitely some room to squeeze out even more space.

    use strict; use warnings; my %cnt; # Generate all possibilities: for my $a1 (0 .. 9) { for my $a2 (0 .. 9) { for my $a3 (0 .. 9) { for my $a4 (0 .. 9) { my @c = sort ($a1, $a2, $a3, $a4); my $s = toStr(@c); $cnt{$s}++; } } } } my $ttl = 0; print <<EOHDR; num pct ttl ttl pct offsets to print ----- ------- ----- ------- ---------------- EOHDR for my $k (sort { $cnt{$b} <=> $cnt{$a} } keys %cnt) { $ttl += $cnt{$k}; printf "%5u %6.2f %5u %6.2f <%s>\n", $cnt{$k}, 100*$cnt{$k}/10000.0, $ttl, 100*$ttl/10000.0, $k; } sub toStr { my @c = @_; my @rv = (); for (my $i=1; $i < @c; $i++) { if ($c[$i] != $c[$i-1] && $c[$i] != $c[$i-1]+1) { push @rv, $c[$i]; } } return join(":",@rv); }

    When I ran it, it shows:

    $ perl funky.pl num pct ttl ttl pct offsets to print ----- ------- ----- ------- ---------------- 840 8.40 840 8.40 <7> 830 8.30 1670 16.70 <8> 682 6.82 2352 23.52 <6> 592 5.92 2944 29.44 <> 524 5.24 3468 34.68 <5> 508 5.08 3976 39.76 <9> 408 4.08 4384 43.84 <5:8> 396 3.96 4780 47.80 <6:9> 396 3.96 5176 51.76 <6:8> 366 3.66 5542 55.42 <4> 336 3.36 5878 58.78 <7:9> 312 3.12 6190 61.90 <5:7>

    What the table means is that the most common result would be that the array @c would yield chr(33 + 7 + $x), and would do so 8.4% of the time. The fourth row shows that the array @c would print nothing at all 5.92% of the time. The ninth row shows that we would print chr(33 + 6 + $x), chr(33 + 8 + $x) 3.96% of the time.

    By using the table and encoding the values in 9 characters, you could also omit the "\n" and read the data as fixed length records, if you don't want to try to find out a method to squeeze out a bit more. I'd suggest reading Huffman_coding and/or Arithmetic_coding for other ideas.

    Edit: tweaked a phrase, fixed the links to wikipedia articles.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Hello Roboticus,

      I just realized that we had the same ideas (me just later ;).

      One difference is that you want to encode each of the 9 groups individually with 6 bits each => 6*9=54 bits per line while I'm encoding a whole line as a polynom

      Sum($g($i) * 50**$i) with $i=0 .. 8

      resulting in the need of 51 bits per line.

      But I don't understand why you say

      >  That's not quite enough to crunch out half the space

      The old encoding needs per line in average >14 bytes plus newline.

      That's > 15 * 8 = 120 bits

      So your 54 bits need already less than half the space!

      What am I missing?

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        LanX:

        You're not missing anything that I know of. What I was basing my "not quite" phrasing on is the idea of using a single character to encode each group (@c) into a character, so it would use 9 characters (72) bits. Had I thought of just packing the required 51 bit records together, it would be more than sufficient to get 50% compression, as the file would take 635 bytes to encode 100 records (sans newlines).

        (The 51 bits came from: 50 different possibilities for each group of 10 in the inner loop (log2(50) == 5.64.. bits/group) * (9 groups) == 50.79 bits.)

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Data compression by 50% + : is it possible?
by davido (Cardinal) on May 12, 2019 at 02:24 UTC

    If no number can be repeated more than once per line then each line can be represented in 90 bits. Bit 0 represents 1, bit 1 represents 2, and bit 89 (the 90th bit) represents 90. A 1 in bit 19 means that there was a 20 in the data set. You said you don't need to preserve order (it would be cool, right but not a requirement).

    Reconstructing the data set will provide the values in numeric order.

    Here's a proof of concept:

    use strict; use warnings; use feature qw(say); use List::Util qw(shuffle); use Data::Compare qw(Compare); sub serialize { my $bv; vec($bv,$_ - 1,1) = 1 for @_; return $bv; } sub deserialize { return grep {vec($_[0], $_-1, 1)} 1 .. 90; } for (1..10) { my @list = (shuffle(1..90))[0..79]; my $vector = serialize(@list); print "length: ", length($vector), ' bytes, ', length($vector) * 8 +, ' bits.'; my @deserialized = deserialize($vector); say "\tThe data ", (Compare(\@deserialized, [sort {$a <=> $b} @lis +t]) ? 'matches.' : 'do not match.'); }

    Each 80 column line could not take more than 96 bits (we waste 6 just to keep it simple).

    Update: If the OP's data looks like what is generated (but with the additional constraint of not having the same value more than one time on a given line) this solution works, but may not achieve an adequate level of compression. Essentially, the longer the line, the better the ratio. The longest any line could be without repeating any ordinal value from 1 to 90 would be 90 characters. In that case, we can achieve almost an 7.5:1 compression ratio using a bit vector and throwing order out the window. But the shorter the input line, the lower the ratio would be. If the line averages twelve characters we have no compression, but still lose order. If the line averages ten characters, we have expansion. So for a 2:1 compression ratio the average line length would need to be 24 characters using my technique.

    Here's a breakdown:

    length | original bit size | stored bit size | compression ratio
    ----------------------------------------------------------------
    1      | 8                 | 96              | 1:12 (expansion)
    4      | 32                | 96              | 1:3  (expansion)
    6      | 48                | 96              | 1:2  (expansion)
    12     | 96                | 96              | 1:1
    14     | 112               | 96              | 1.16:1 (14% compression)
    24     | 192               | 96              | 2:1  (50% compression)
    36     | 288               | 96              | 3:1
    48     | 384               | 96              | 4:1  (75% compression)
    60     | 480               | 96              | 5:1
    72     | 576               | 96              | 6:1
    84     | 672               | 96              | 7:1
    90(max)| 720               | 96              | 7.5:1
    

    So break-even is 12 values. 50% size reduction comes at 24 values. 75% size reduction comes at 48 values per line. This solution will only meet the OP's requirement of a 50% size reduction if the average line length is values. As an artifact of using 96 bits to represent 90 values, we have six bits left over to use to encode newline, and even undef, if we want. Also, if the OP's data set does contain newlines (which we can encode as bit 90, the newline gets a 50% reduction, but still fits within the 96 bits, so serves to improve our compression ratio. If the average line length is 12, plus a newline (which is two characters), we are compressing 14 characters (112 bits) into 96 bits of space, so 1.16:1 compression. Nothing to write home about.


    Dave

Re: Data compression by 50% + : is it possible?
by LanX (Saint) on May 11, 2019 at 22:47 UTC
    I suppose zipping is no option?

    So what's wrong with storing only the differences for a line?

    Consecutive entries are normally less than 10 apart. You only need 4 bits for 0-15.

    I'd probably use 0 as escape code to enter bigger differences a bit like utf 8 does its compression.

    I had to download and run your code on my mobile, some sample output would have been nice.

    Did it as a proof of concept to run Emacs and Perl inside termux there, which is uber mega cool!!! xD

    Here my tweaked insights for others (I deleted the ord and the 33+ and added delimiters for each number and group.

    2:5:8: 13:17: 22:25:27: 33:39: 45: 55:58: 67: 77: + 83:86: 2: 12:15:18: 25:29: 34:39: 44:48: 58: 62:66:68: 7 +2:75: 83:87: 3:6: 17: 26: 33:36: 45:47: 57: 65:69: 73:76:79: + 84:88: 8: 15: 23: 36: 43:49: 55: 75: 84:89:

    Update
    So any code d with 0<d<16 is a difference

    0ab is the code for a difference with 0<0xab<256

    You'll also need to denote the line's end.

    Either by a counter at the beginning or with 000 or 0000 for newline.

    Since this code works with 4 bit nibbles you'll get a binary format.

    If you need an ASCII representation use base64 then.

    If you need better compression look up Huffman coding on WP.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Supposing your input is correct and that it's truly random, than it should be possible to represent each line with ~ 7.356 bytes or ~ 59 bits.

      You have 9 groups with 0-3 numbers in the range 2..9.

      I.e each group can be represented with a byte with at most 3 bits set.

      There are only 93=56+28+8+1 such combinations possible.

      ln(93*9)/ln(256)= 7.35655366 bytes per line

      At the moment you'll need -2.5 characters per group which results in -22.5 char per line. (56*3+28*2+8*1)/93

      That's about one third.So even with a non binary representation you should achieve your 50 percent or better.

      This can only be improved if the combinations don't have the same likelihood.

      I don't wanna dig deeper because I don't trust your code and smell an xy problem here.

      Update

      I just realised that you are forbidding consecutive numbers in your if condition. I.e (2,3,9) is never possible.

      This will change the math, but the approach is the same.

      Roboticus said you need 15 char in average 7.4 bytes per line is just an upper boundary, so 50% is easily reached.

      Don't wanna calculate it again! This would be needed to be done programmatically.

      (But I don't trust your code anyway ;)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        Instead of calculating I wrote a little script counting the probabilities of group combinations.

        Roboticus was right, you need about 14,05 characters per line plus "\n"

        Only 1 + 8 + 21 + 20 = 50 combinations are possible per group, resulting in a conservative compression of 50.79 bits = 6.34 bytes per line, which already means a compression to 42% = 58% win.

        But if you look into the likelihood of those combinations you clearly see that a Huffman encoding would result in an even better ratio.

        I think that's then near the theoretical optimum. (further reading Huffman_coding_with_unequal_letter_costs )

        All this supposing your input was real... ;-)

        use strict; use warnings; use Data::Dump qw/pp/; my %count; for my $c0 (0..9){ for my $c1 (0..9){ for my $c2 (0..9){ for my $c3 (0..9) { my @c = sort {$a <=>$b} ($c0,$c1,$c2,$c3); #print "@c\t:\t"; my @allowed; for my $i (1..3) { if ( $c[$i] != $c[$i-1] && $c[$i] != $c[$i-1]+1 ) +{ push @allowed, $c[$i] } } #print "@allowed\n"; $count{join "",@allowed}++ } } } } my @length; my $average; for my $k (keys %count){ my $len = length $k; $length[$len]++; $average+= $len* $count{$k}/10000; } warn '@length: ', pp \@length; warn 'average #characters: group/line',pp [$average,$average*9]; my $combies =0; $combies+= $length[$_] for 0..3; #$combies=93; warn "# possible combinations: ", $combies; my $upper_bound= log($combies)/log(2)*9; warn 'Upper bound bits, bytes', pp [$upper_bound, $upper_bound/8]; #warn "ranking", pp [ sort {$b <=>$a} values %count ]; warn 'probabilities: ',pp \%count;

        @length: [1, 8, 21, 20] at /tmp/compress.pl line 36. average #characters: group/line[1.5624, 14.0616] at /tmp/compress.pl l +ine 37. # possible combinations: 50 at /tmp/compress.pl line 44. Upper bound bits, bytes[50.7947057079725, 6.34933821349656] at /tmp/co +mpress.pl line 48. probabilities: { "" => 592, "2" => 74, "24" => 60, "246" => 24, "247" => 24, "248" => 24, "249" => 24, "25" => 84, "257" => 24, "258" => 24, "259" => 24, "26" => 84, "268" => 24, "269" => 24, "27" => 84, "279" => 24, "28" => 84, "29" => 60, "3" => 208, "35" => 144, "357" => 48, "358" => 48, "359" => 48, "36" => 192, "368" => 48, "369" => 48, "37" => 192, "379" => 48, "38" => 192, "39" => 144, "4" => 366, "46" => 228, "468" => 72, "469" => 72, "47" => 300, "479" => 72, "48" => 300, "49" => 228, "5" => 524, "57" => 312, "579" => 96, "58" => 408, "59" => 312, "6" => 682, "68" => 396, "69" => 396, "7" => 840, "79" => 336, "8" => 830, "9" => 508, } at /tmp/compress.pl line 52.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

        update

        I just realized that Roboticus already had the same basic ideas here: Re: Data compression by 50% + : is it possible?

Re: Data compression by 50% + : is it possible?
by tybalt89 (Monsignor) on May 14, 2019 at 00:30 UTC

    Fun with bit strings :)

    Here's compress and decompress subs with supporting hashes.
    Using your test generator, I get about 55% compression most of the time. This is better than gzip or bzip2 on the one test case I tried. gzip was below 30% and bzip2 was below 45%.

    There are pathological cases however. If you replace your rand(10) with 9 you will get a 675% expansion instead of compression.

    The compressor maps 4 lines of your test input to 27 bytes and the decompressor does the opposite, so I suggest you compress in multiples of four lines, and decompress in multiples of 27 bytes if you have to split up very large files.
    decompress(compress($string)) does reproduce the input string exactly, however, no matter how many lines are in it (at least in my testing :)

    #!/usr/bin/perl # https://perlmonks.org/?node_id=1233613 use strict; use warnings; my @legal = grep !/11/ && tr/1// <= 3, glob '{0,1}' x 8; my %code; @code{@legal} = map { unpack 'b6', chr } 0 .. $#legal; my %decode = reverse %code; $_ = [ split ' ', '23456789' & tr/01/ ?/r ] for values %decode; sub compress { my $coded = ''; for ( shift =~ /(.*)\n/g ) { my @lookup = (0) x 123; @lookup[ unpack 'C*', $_ ] = (1) x length; for( my $group = 35; $group < 123; $group += 10 ) { $coded .= $code{ join '', @lookup[$group .. $group + 7] }; } } return pack 'b*', $coded; } sub decompress { my $decoded = ''; for my $line ( unpack('b*', shift) =~ /.{54}/g ) { my $decade = 33; for my $group ( @decode{ unpack '(a6)*', $line } ) { $decoded .= pack 'C*', map $decade + $_, @$group; $decade += 10; } $decoded .= "\n"; } return $decoded; } my $input = ''; for (1 .. 100) { for (my $x=0; $x<90; $x+=10) { my @c; push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); @c = sort{$a<=>$b}@c; for (my $i = 1; $i < @c; $i++) { $input .= chr(33+$c[$i]) if $c[$i] != $c[$i-1] && $c[$i] != $c[$ +i-1]+1; } } $input .= "\n"; } #use Data::Dump 'dd'; dd $_ for $input =~ /.*\n/g; print "\n input length ", length $input, "\n"; my $compressed = compress($input); my $compressedlength = length $compressed; print "compressed length $compressedlength\n"; my $restored = decompress($compressed); if( $input eq $restored ) { printf "\nMatched, compression ratio = %.1f%%\n", 100 * (1 - length($compressed) / length($restored)); } else { print "----------------------------------------failed\n"; use Data::Dump 'dd'; dd $_ for $restored =~ /.*\n/g; }

    Output of typical run (100 random lines) :

    input length 1507 compressed length 675 Matched, compression ratio = 55.2%

    Thanks for the fun chance to play with long bit strings.

Re: Data compression by 50% + : is it possible?
by BrowserUk (Patriarch) on May 11, 2019 at 23:03 UTC

    Impossible! If your data is truly random.

    You need a gnat's under 6.5 bits to represent each of your values and are currently using 8 bits. If you could pack it exactly (meaning using 1/2 bits), then the best you could achieve is 6.5/8 = 18.75%.

    If you try dictionary lookup:

    1. The dictionary for 2 byte pairings requires 13 bits; no saving over 16.
    2. 3 byte pairings require 19.5, no saving over 24.
    3. Best you could do is using a 40-bit number to represent each of the 531441000000 6-byte pairings; giving 40/48 16.67% saving.

    Beyond that, your into the luck of the draw. Some random datasets might contain enough common (sub)sequences to allow Lempel–Ziv or similar to get close to 50%, but for other datasets, the same algorithm will produce barely any reduction, (or even an increase).

    See Kolmogorov_complexity for more.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      BrowserUK: but it's not truly random, it sets some rules: can not have consecutive numbers in a single line. Can one work this out analytically?

        It makes a small -- insignificant -- difference to the outcome.

        A quick check -- because quicker than trying to derive a formula -- shows that of the 729,000 3-value combinations, 11,839 contain consecutive numbers:

        use Algorithm::Combinatorics qw[ variations_with_repetition ];; @c = variations_with_repetition( [ 1 .. 90 ], 3 );; print scalar @c;; 729000 $d = 0; $_->[0]+1 == $_->[1] or $_->[1]+1 == $_->[2] and ++$d for @c; +print $d;; 7922

        And there are 8010 combinations of 2 sets of three that must be excluded because the last digit of the first set of 3 is one less that the first digit of the second set:

        $d = 0; $_->[0] == $_->[2]+1 and ++$d for @c; print $d;; 8010

        Which means that instead of 531,441,000,000 6-value combinations, there are only (729,000 - 7922 )**2 - 8010 = 519,953,474,074, which means it would still take 40-bits to represent any legal 6-value string; so the math doesn't change: ( 1 - 40/48 ) *100 = 16.67% is best possible for *any* dataset.

        Any algorithm that achieves better on any given dataset; will not achieve the same results for all datasets; nor even a high percentage of them.

        Ie. To achieve better, you'd need to reduce the size of the domain so that you could compress 36-bits into 48 which would give 25% compression. But that would require throwing away 87% of the possible datasets


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
        but you are wrong.

        Prove it!

        Take time out of your always 'too busy' schedule, and post some code. Working, code. Code that I can throw a few sample datasets at and see if you roboticus have really defeated a 30+ y/o CompSci theoretical fact.

        If you can post code that can compress *all* OP-compliant datasets, by more that 16.67%; I'll buy you him dinner.

        Better still -- and you'll like this idea -- I'll stop coming around here.

        So, PROVE IT!; or forever pursue your favorite pastime and "hold your piece".


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
        In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
Re: Data compression by 50% + : is it possible?
by bliako (Monsignor) on May 11, 2019 at 23:53 UTC

    It's too late for me and I may be wrong, but storing the first character and then the deltas (e.g. consecutive character differences) (got the idea from LanX's Re: Data compression by 50% + : is it possible?) achieved ~50% compression using gzip:

    # a.pl use strict; #never warnings; my @inp = (); #daraset for (1..100){ my @ani = (); for (my $x=0; $x<90; $x+=10){ my @c; push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); push(@c, int (rand(10)+$x)); @c = sort{$a<=>$b}@c; for (my $i = 1; $i < @c; $i++){ push(@ani, chr(33+$c[$i])) if $c[$i] != $c[$i- +1] && $c[$i] != $c[$i-1]+1 ; } } push @inp, \@ani; } foreach my $abc (@inp){ print $abc->[0]; for(my $x=1;$x<@$abc;$x++){ print ord($abc->[$x])-ord($abc->[$x-1]); } print "\n"; }
    perl a.pl > xyz ls -al xyz gzip xyz ls -al xyz.gz
Re: Data compression by 50% + : is it possible?
by ikegami (Patriarch) on May 12, 2019 at 21:34 UTC

    Since they don't repeat, each subsequent number requires one less bit to store.

    If we store each number using the minimal number of bites, we can pack the information into 63 bytes, which is 70% of using a byte per number. (30% savings)

    $ perl -MPOSIX=ceil -e' my $acc = 0; $acc += ceil(log($_)/log(2)) for 2..90; CORE::say $acc; ' 63

    There's still a lot of loss in there. Each number introduces a little bit of loss. By using a variable-base number, we only loss for the whole line.

    If you use a variable-base number, we can pack the information into 58 bytes, which is 64% of using a byte per number. (36% savings)

    $ perl -MPOSIX=ceil -MMath::BigInt -MMath::BigRat -e' my $acc = Math::BigRat->new(1); $acc *= $_ for 2..90; CORE::say Math::BigRat->new($acc)->blog(256)->bceil; ' 58

      ...and that's just from efficient storing. No "compression" has been done yet. Compression techniques can be used in addition to this.

Re: Data compression by 50% + : is it possible?
by salva (Canon) on May 13, 2019 at 07:21 UTC
    It needs to be a lossless compression scheme and the order needs not to be preserved

    Without preserving the order, you can reduce it to 720 bytes (or even less). You just have to count the number of times every number appear and store that (using 64bit integers, 8 * 90 = 720bytes)

      I think he meant the order in a line, because every code point is unique there. :).

      Just storing the frequency table is a good point. xD

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Data compression by 50% + : is it possible?
by harangzsolt33 (Chaplain) on May 12, 2019 at 05:00 UTC
    The 50% compression rate is possible. More is possible. Depends on the arrangement of input data and the algorithm you're using!

    Here's an idea: Take the first bit of each number and create a list of numbers from that. See if you can compress that list at a better rate. Take the second bit of each input number and create another list, and so on... If your numbers are all even-odd-even-odd or all odd or all even numbers, then this method will help.

    If the input numbers are totally random, I would still not give up just yet! I'd generate a list of "random" numbers and XOR the input numbers with randoms to get a new list that has a better chance of being compressed. Try that!

    Most programs generate random numbers this way:

    FOR LOOP:
       S = (S * A + B) % C
       print "Random number: ", S
    END FOR

    S is the initial seed for the random number generator. Programs usually set this to the number of milliseconds since 1970. A, B, and C are constants that can be any random value. In many programming languages the builtin random() function usually returns a number between 0 and 1. And in order to get that, C must be 1. If you repeat this calculation over and over again, you get a list of numbers that seems quite random.

    By modifying the values of S, A, B, or C even slightly, you get a totally different series of numbers! If, let's say, A is 13.4849927107, and you just change one digit, you will get a totally different list of numbers that does not resemble the previous set at all. So, you could initialize these constants and then get a random list. Take two random lists and either ADD the values or XOR them or whatever. The resulting list MIGHT HAVE more order than your input data set! And this can help you compress the list further.

    I've done this with ZIP files... You know, when you compress a ZIP file and you compress it again and again, you reach a limit after which the size starts growing instead of shrinking! But if, at some point, you encode the ZIP file using a list of random numbers, you can sometimes ZIP it again further and get an even smaller file! ;-)

Re: Data compression by 50% + : is it possible?
by LanX (Saint) on May 13, 2019 at 00:27 UTC
    > Just to settle "is the code right" issue. The code is right.

    The question is rather if the distribution of values is like the one simulated by your random numbers.

    The other one if you need the "readable" character range or if a binary file is OK.

    Please look at the probability output from the script I posted here and tell us if it's accurate, or even better calculate the frequencies of groups from your real data.

    I expect a lossless 50% reduction to be easy, because of the unused gaps in your data.

    A better compression will need Huffman coding, but for this to work you need the frequency table anyway.

    FWIW there are two Huffman modules on CPAN and one script here in the archives.

    update

    And please post proper replies, I only found your update by accident.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      sorry for the off-the-thread updates :)

      the distribution of values is exactly as the one simulated. so every time the automata spits the number, it is either larger than the previous, never consecutive (larger by +1) and never larger than 90.

      the readable output is, what it spits out but binary is ok. i just need to store it.

      Frequency table looks ok but i'll check it one more time and post if i find irregularities. However, i understand what you and roboticus did. Thank you for the input !! :)

      PS: yes the order i was talking about was on rows

        Great! :)

        FWIW Compress::Huffman looks promising.

        It seems to take a frequency table as input (actually probability, so divide my table by 10000) and to store it together with a bit string and decode it again.

        On a side note:

        You might want to experiment with changed weights, that is multiply the frequency with the length of the key, like 246 with 3 and recalculate the proportions.

        This could give you a better compression compared to the old format because there you need 3 characters for this entry.

        (I'm not sure here, it's mind boggling)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Data compression by 50% + : is it possible?
by harangzsolt33 (Chaplain) on May 14, 2019 at 17:17 UTC
    If we calculate the number of possible ways one could arrange 90 numbers, that is a very large number. And if my calculations were correct, it would require 463 bits to store that number! So, using this method, it's not possible to compress by 50%, because 90 bytes is 720 bits. OR if you store each int in 7 bits, then that's 630 bits total. So, you have a list of 630 bits. And mathematically, it is possible to store this list of ints in 463 bits BECAUSE you said that each int from 1 to 90 only occurs once!

    But as I said, the compression ratio depends on the algorithm you're using! Using the mathematic approach I explained above, it is possible to store 630 bits in 463 bits ALWAYS regardless of how the numbers are shuffled! That means the output will be EXACTLY 463 bits regardless of whether we started with a list that is perfectly sorted or totally random.

      Many people in this thread miss crucial information already given in the OP's text!

      (apart from reading the explicit example code given)

      > - order needs not to be preserved

      > - occur only once in a given line.

      > - They cannot be consecutive (meaning there is no sequence in a dataset).

      I.e. tuples like (3,1, ...), (1,1,...) or (1,2, ...) are impossible. (see OPs if condition)

      But the OP's format is obviously highly redundant, he's not only

      • allowing such tuples
      • but also unsorted input
      • and wasting a full byte to encode an 0..9 increment in 9 intervals

      Alone the last point leaves sufficient room for compression far beyond near 50%.

      Roboticus and I already elaborated this explicitly by demonstrating all possible independent tuples and pointing to their near optimal compression using Huffman coding.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Data compression by 50% + : is it possible?
by QM (Parson) on May 15, 2019 at 13:24 UTC
    Given the data restrictions, has anyone considered using vec to compose a set?

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      I would have if vec handled 6 bit fields, but sadly it doesn't.

        Hmmm. I was just thinking of a 90-bit vec as a set. So the bits in vec(expr, offset, bits) is 1.

        I've probably missed some criterion why the vec can't be interpreted as 8bit chars for output.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: Data compression by 50% + : is it possible?
by betmatt (Scribe) on May 14, 2019 at 02:52 UTC
    Hi,

    Without reading through all the replies....To be honest you have had quite a few.

    My advice would be that there is always the potential to compress. Even in random sequences you will get repeated patterns. The difficulty is finding those patterns. You want a pattern that is easy to find first.

    My advice would be to sort the text of interest and then count for each character type. That might be best done in a database. You might want to split the text up as well and do that bit by bit. With this information you will be able to better find opportunities for compression.


    Update__________________


    With infinite computing power LanX would be right.

    For very large file sizes it would be difficult if not impossible to find the best compression solution. In that situation I would be right.

    I know that the question states 50%. But really if you think about it you could compress the stored algorithms that do the transformation. It just goes on and on. Do people understand what I am saying?

    Sorting is a good place to start maybe because the algorithm's (code) can be modified from that point in order to preserve information that will allow the recreation of the original file. There are a range of different sorting algorithms that I have in a book here. If anyone want me to post any of those I will.
      > My advice would be that there is always the potential to compress. Even in random sequences you will get repeated patterns. The difficulty is finding those patterns. You want a pattern that is easy to find first.

      That's drivel, you can only compress independent information without loss as long as density allows.

      A bit will never encode 3 states.

      update

      > Do people understand what I am saying?

      no

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Re: Data compression by 50% + : is it possible?
by holli (Abbot) on May 14, 2019 at 10:41 UTC


    holli

    You can lead your users to water, but alas, you cannot drown them.

      Is this an example of a highly compressed answer? I’m not sure your algorithm isn’t lossless. :P

        Never heard of quantum-comp-posting???

        If you reload the node often enough it'll show the answer to everything. ;-P

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery FootballPerl is like chess, only without the dice