Instead of calculating I wrote a little script counting the probabilities of group combinations.
Roboticus was right, you need about 14,05 characters per line plus "\n"
Only 1 + 8 + 21 + 20 = 50 combinations are possible per group, resulting in a conservative compression of 50.79 bits = 6.34 bytes per line, which already means a compression to 42% = 58% win.
But if you look into the likelihood of those combinations you clearly see that a Huffman encoding would result in an even better ratio.
I think that's then near the theoretical optimum. (further reading Huffman_coding_with_unequal_letter_costs )
All this supposing your input was real... ;-)
use strict;
use warnings;
use Data::Dump qw/pp/;
my %count;
for my $c0 (0..9){
for my $c1 (0..9){
for my $c2 (0..9){
for my $c3 (0..9) {
my @c = sort {$a <=>$b} ($c0,$c1,$c2,$c3);
#print "@c\t:\t";
my @allowed;
for my $i (1..3) {
if ( $c[$i] != $c[$i-1] && $c[$i] != $c[$i-1]+1 )
+{
push @allowed, $c[$i]
}
}
#print "@allowed\n";
$count{join "",@allowed}++
}
}
}
}
my @length;
my $average;
for my $k (keys %count){
my $len = length $k;
$length[$len]++;
$average+= $len* $count{$k}/10000;
}
warn '@length: ', pp \@length;
warn 'average #characters: group/line',pp [$average,$average*9];
my $combies =0;
$combies+= $length[$_] for 0..3;
#$combies=93;
warn "# possible combinations: ", $combies;
my $upper_bound= log($combies)/log(2)*9;
warn 'Upper bound bits, bytes', pp [$upper_bound, $upper_bound/8];
#warn "ranking", pp [ sort {$b <=>$a} values %count ];
warn 'probabilities: ',pp \%count;
@length: [1, 8, 21, 20] at /tmp/compress.pl line 36.
average #characters: group/line[1.5624, 14.0616] at /tmp/compress.pl l
+ine 37.
# possible combinations: 50 at /tmp/compress.pl line 44.
Upper bound bits, bytes[50.7947057079725, 6.34933821349656] at /tmp/co
+mpress.pl line 48.
probabilities: {
"" => 592,
"2" => 74,
"24" => 60,
"246" => 24,
"247" => 24,
"248" => 24,
"249" => 24,
"25" => 84,
"257" => 24,
"258" => 24,
"259" => 24,
"26" => 84,
"268" => 24,
"269" => 24,
"27" => 84,
"279" => 24,
"28" => 84,
"29" => 60,
"3" => 208,
"35" => 144,
"357" => 48,
"358" => 48,
"359" => 48,
"36" => 192,
"368" => 48,
"369" => 48,
"37" => 192,
"379" => 48,
"38" => 192,
"39" => 144,
"4" => 366,
"46" => 228,
"468" => 72,
"469" => 72,
"47" => 300,
"479" => 72,
"48" => 300,
"49" => 228,
"5" => 524,
"57" => 312,
"579" => 96,
"58" => 408,
"59" => 312,
"6" => 682,
"68" => 396,
"69" => 396,
"7" => 840,
"79" => 336,
"8" => 830,
"9" => 508,
} at /tmp/compress.pl line 52.
update
I just realized that Roboticus already had the same basic ideas here: Re: Data compression by 50% + : is it possible? |