Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Data compression by 50% + : is it possible?

by LanX (Saint)
on May 11, 2019 at 22:47 UTC ( [id://1233617]=note: print w/replies, xml ) Need Help??


in reply to Data compression by 50% + : is it possible?

I suppose zipping is no option?

So what's wrong with storing only the differences for a line?

Consecutive entries are normally less than 10 apart. You only need 4 bits for 0-15.

I'd probably use 0 as escape code to enter bigger differences a bit like utf 8 does its compression.

I had to download and run your code on my mobile, some sample output would have been nice.

Did it as a proof of concept to run Emacs and Perl inside termux there, which is uber mega cool!!! xD

Here my tweaked insights for others (I deleted the ord and the 33+ and added delimiters for each number and group.

2:5:8: 13:17: 22:25:27: 33:39: 45: 55:58: 67: 77: + 83:86: 2: 12:15:18: 25:29: 34:39: 44:48: 58: 62:66:68: 7 +2:75: 83:87: 3:6: 17: 26: 33:36: 45:47: 57: 65:69: 73:76:79: + 84:88: 8: 15: 23: 36: 43:49: 55: 75: 84:89:

Update
So any code d with 0<d<16 is a difference

0ab is the code for a difference with 0<0xab<256

You'll also need to denote the line's end.

Either by a counter at the beginning or with 000 or 0000 for newline.

Since this code works with 4 bit nibbles you'll get a binary format.

If you need an ASCII representation use base64 then.

If you need better compression look up Huffman coding on WP.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

Replies are listed 'Best First'.
Re^2: Data compression by 50% + : is it possible?
by LanX (Saint) on May 12, 2019 at 00:34 UTC
    Supposing your input is correct and that it's truly random, than it should be possible to represent each line with ~ 7.356 bytes or ~ 59 bits.

    You have 9 groups with 0-3 numbers in the range 2..9.

    I.e each group can be represented with a byte with at most 3 bits set.

    There are only 93=56+28+8+1 such combinations possible.

    ln(93*9)/ln(256)= 7.35655366 bytes per line

    At the moment you'll need -2.5 characters per group which results in -22.5 char per line. (56*3+28*2+8*1)/93

    That's about one third.So even with a non binary representation you should achieve your 50 percent or better.

    This can only be improved if the combinations don't have the same likelihood.

    I don't wanna dig deeper because I don't trust your code and smell an xy problem here.

    Update

    I just realised that you are forbidding consecutive numbers in your if condition. I.e (2,3,9) is never possible.

    This will change the math, but the approach is the same.

    Roboticus said you need 15 char in average 7.4 bytes per line is just an upper boundary, so 50% is easily reached.

    Don't wanna calculate it again! This would be needed to be done programmatically.

    (But I don't trust your code anyway ;)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      Instead of calculating I wrote a little script counting the probabilities of group combinations.

      Roboticus was right, you need about 14,05 characters per line plus "\n"

      Only 1 + 8 + 21 + 20 = 50 combinations are possible per group, resulting in a conservative compression of 50.79 bits = 6.34 bytes per line, which already means a compression to 42% = 58% win.

      But if you look into the likelihood of those combinations you clearly see that a Huffman encoding would result in an even better ratio.

      I think that's then near the theoretical optimum. (further reading Huffman_coding_with_unequal_letter_costs )

      All this supposing your input was real... ;-)

      use strict; use warnings; use Data::Dump qw/pp/; my %count; for my $c0 (0..9){ for my $c1 (0..9){ for my $c2 (0..9){ for my $c3 (0..9) { my @c = sort {$a <=>$b} ($c0,$c1,$c2,$c3); #print "@c\t:\t"; my @allowed; for my $i (1..3) { if ( $c[$i] != $c[$i-1] && $c[$i] != $c[$i-1]+1 ) +{ push @allowed, $c[$i] } } #print "@allowed\n"; $count{join "",@allowed}++ } } } } my @length; my $average; for my $k (keys %count){ my $len = length $k; $length[$len]++; $average+= $len* $count{$k}/10000; } warn '@length: ', pp \@length; warn 'average #characters: group/line',pp [$average,$average*9]; my $combies =0; $combies+= $length[$_] for 0..3; #$combies=93; warn "# possible combinations: ", $combies; my $upper_bound= log($combies)/log(2)*9; warn 'Upper bound bits, bytes', pp [$upper_bound, $upper_bound/8]; #warn "ranking", pp [ sort {$b <=>$a} values %count ]; warn 'probabilities: ',pp \%count;

      @length: [1, 8, 21, 20] at /tmp/compress.pl line 36. average #characters: group/line[1.5624, 14.0616] at /tmp/compress.pl l +ine 37. # possible combinations: 50 at /tmp/compress.pl line 44. Upper bound bits, bytes[50.7947057079725, 6.34933821349656] at /tmp/co +mpress.pl line 48. probabilities: { "" => 592, "2" => 74, "24" => 60, "246" => 24, "247" => 24, "248" => 24, "249" => 24, "25" => 84, "257" => 24, "258" => 24, "259" => 24, "26" => 84, "268" => 24, "269" => 24, "27" => 84, "279" => 24, "28" => 84, "29" => 60, "3" => 208, "35" => 144, "357" => 48, "358" => 48, "359" => 48, "36" => 192, "368" => 48, "369" => 48, "37" => 192, "379" => 48, "38" => 192, "39" => 144, "4" => 366, "46" => 228, "468" => 72, "469" => 72, "47" => 300, "479" => 72, "48" => 300, "49" => 228, "5" => 524, "57" => 312, "579" => 96, "58" => 408, "59" => 312, "6" => 682, "68" => 396, "69" => 396, "7" => 840, "79" => 336, "8" => 830, "9" => 508, } at /tmp/compress.pl line 52.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery FootballPerl is like chess, only without the dice

      update

      I just realized that Roboticus already had the same basic ideas here: Re: Data compression by 50% + : is it possible?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1233617]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (2)
As of 2024-04-26 02:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found