Fingerprinting text documents for approximate comparison

Mur has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Fingerprinting text documents for approximate comparison
by ww (Archbishop) on Mar 24, 2005 at 17:38 UTC

request for comments

For articles in which a piece of wirecopy has a paragraph or so of text inserted by a newspaper's local editors or reporters into the (otherwise unchanged) wire source the fingerprint matching will FAIL (which may be what's desired, if failed matches, are, for example, flagged to human attention or to more-sophisticated-processes).

Further study of failed matches can be a waste -- if the local insert merely "localizes" (in the newroom use of that term) the story without adding anything significant, as, for example, "Sheriff Numnutz of our_own_county said he had instituted an even better program last duration_entity...."

Update - OTOH ...but could be highly valuable if the local insert to a gushing, bloviating wire service story on some hyper-hyped tech story were to say "...person_in_main_story was convicted of releasing buggy software in local_court last duration_entity."

Also a "waste" would be instances where a story gets flagged as non-matching because the jump line ("Continued on page 24" or "see 'Numnutz reacts'" etc) occurs at a different location in the story (ie, paper A gave it 5 para their webpage before linking to "full text" while paper B gave it only 3.

So the "nearness" test has a lot to commend it... but it seems to me that the plan by which you determine WHICH SITES to sample and the vagaries of their layouts is going to overwhelm the "nearness" test -- at least in a good many cases).

And, pursuing the good Monks' suggestions above, the CPAN synopsis re Digest::Nilsimsa says, in part, "A nilsimsa signature is a statistic of n-gram occurance in a piece of text. It is a 256 bit value usually represented in hex." Does that 256 bit limit reduce its effectiveness as the length of the text increases?

updated definition:"An N-gram is a sequence of length N. For example, sequences of words within texts are N-grams but the idea is much more general."
...and

"An N-gram is like a moving window over a text, where N is the number of words in the window.
igrams: two consecutive words
Trigrams: three consecutive words
Quadrigrams: four consecutive words
And so on.
N-gram analysis calculates the probability of a given sequence of words occurring: if you know what the first word of a pair is, how confidently can you predict the next one, using the conditional bigram probability $ P(w_n \vert w_{n-1})$.").

Possibly supporting my naive hypothesis that the worth of the nilsimsa value goes down as the length of the source text increases is this, from A site referenced in Digest::Nilsimsa:
"The nilsimsa of these two codes is 92 on a scale of -128 to +128. That means that 36 bits are different and 220 bits the same. Any nilsimsa over 24 (which is 3 sigma) indicates that the two messages are probably not independently generated." because ( arguing against own previous) comparing two large and similar texts, one of which is distinguished only by a relatively small insert (and possible formatting changes which OP is seeking to exclude from the testing), would seem likely to yield a nilsims of well over 24.


"be consistent"
	PerlMonks