I've taken a look at Digest::Nilsimsa. It is a cool thing. However, I've found a problem.
for my $d ( 30 .. 36 ) {
my $this = $nil->text2digest( ( 'a' x $d ) . 'b' ) ;
my $that = $nil->text2digest( ( 'a' x ( $d + 1 ) ) ) ;
print nilcomp( $this , $that ) ;
}
sub nilcomp {
my $diff = 0 ;
my $diff2 = 0 ;
my @this = split /|/ , shift ;
my @that = split /|/ , shift ;
for my $a ( 0 .. scalar(@this)-1 ) {
$diff++ if $this[$a] ne $that[$a] ;
my $is = hex $this[$a] ;
my $at = hex $that[$a] ;
if ( $is != $at ) {
$diff2 += abs $is - $at ;
}
}
return ( join "" , @this) . qq(\n) .
( join "" , @that) . qq(\n) .
$diff . qq( characters different\n) .
( abs $diff2 ) . qq( bits different\n\n);
}
gives you 000000000000900000010021000008105000080010000004000c400000000008
0000000000009000000000200000080040000000000000040008400000000000
8 characters different
25 bits different
000000000000900000010021000008105000080010000004000c400000000008
0000000000009000000000200000080040000000000000040008400000000000
8 characters different
25 bits different
000000000000900000010021000008105000080010000004000c400000000008
0000000000009000000000200000080040000000000000040008400000000000
8 characters different
25 bits different
000000000000900000010021000008105000080010000004000c400000000008
0000000000009000000000200000080040000000000000040008400000000000
8 characters different
25 bits different
000000000000900000010021000008105000080010000004000c400000000008
0000000000009000000000200000080040000000000000040008400000000000
8 characters different
25 bits different
0000000000009000000000200000080040000000000000040008400000000000
0000000000009000000000200000080040000000000000040008400000000000
0 characters different
0 bits different
0000000000009000000000200000080040000000000000040008400000000000
0000000000009000000000200000080040000000000000040008400000000000
0 characters different
0 bits different
If all I was trying to use this on were 35-character data sets, that'd be cool, but I'm trying to run this on whole web pages. I pull out all markup and whitespace, I'll still be in the headers by the time the 35th character rolled around. I love it in theory, but in practice, the data's too big for the module.
So, I could do this: $output =~ s[(.{35})][$nilsimsa->text2digest($1)]ge ; or something of the sort, but that seems ... goofy.
But it does point out that taking length $output and comparing it to last time should indicate a small change, if they're only a small number of characters apart.
| [reply] [d/l] [select] |