Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Fingerprinting text documents for approximate comparison

by Mur (Pilgrim)
on Mar 24, 2005 at 16:21 UTC ( [id://442095]=perlquestion: print w/replies, xml ) Need Help??

Mur has asked for the wisdom of the Perl Monks concerning the following question:

I need a way to calculate a "fingerprint" on a modest-sized text document that will allow matching this document to ones which are very similar (differing by formatting, perhaps a few small insertions, etc.). Specifically, I want to match up text-rendered web pages (we run the HTML through a text-mode browser like "lynx" and dump the output) which may be "nearly" identical (for example, news stories at multiple sites might be the same story from a wire service, with some minor formatting changes depending on whether the story is on one newspaper's website or another).

What we have tried so far is to do things like drop all small words (<= 5 characters), reduce to lower-case, and remove all whitespace, then checksum. That's fairly accurate, but it doesn't give me a "nearness" number. I'd like some reliable way of pointing at two different 10K text strings, and saying "string A is 99% likely to be the same article as string B". I've thought about using String::Approx for this, too. Haven't actually worked it up into a solution yet.

--
Jeff Boes
Database Engineer
Nexcerpt, Inc.
vox 269.226.9550 ext 24
fax 269.349.9076
 http://www.nexcerpt.com
...Nexcerpt...Connecting People With Expertise

Replies are listed 'Best First'.
Re: Fingerprinting text documents for approximate comparison
by ww (Archbishop) on Mar 24, 2005 at 17:38 UTC
    ++! Fascinating project, so please tolerate (or skip) this un-perlish reply (ie, algorithmic only) and request for comments:

    For articles in which a piece of wirecopy has a paragraph or so of text inserted by a newspaper's local editors or reporters into the (otherwise unchanged) wire source the fingerprint matching will FAIL (which may be what's desired, if failed matches, are, for example, flagged to human attention or to more-sophisticated-processes).

    Further study of failed matches can be a waste -- if the local insert merely "localizes" (in the newroom use of that term) the story without adding anything significant, as, for example, "Sheriff Numnutz of our_own_county said he had instituted an even better program last duration_entity...."

    Update - OTOH ...but could be highly valuable if the local insert to a gushing, bloviating wire service story on some hyper-hyped tech story were to say "...person_in_main_story was convicted of releasing buggy software in local_court last duration_entity."

    Also a "waste" would be instances where a story gets flagged as non-matching because the jump line ("Continued on page 24" or "see 'Numnutz reacts'" etc) occurs at a different location in the story (ie, paper A gave it 5 para their webpage before linking to "full text" while paper B gave it only 3.

    So the "nearness" test has a lot to commend it... but it seems to me that the plan by which you determine WHICH SITES to sample and the vagaries of their layouts is going to overwhelm the "nearness" test -- at least in a good many cases).

    And, pursuing the good Monks' suggestions above, the CPAN synopsis re Digest::Nilsimsa says, in part, "A nilsimsa signature is a statistic of n-gram occurance in a piece of text. It is a 256 bit value usually represented in hex." Does that 256 bit limit reduce its effectiveness as the length of the text increases?

    updated definition:"An N-gram is a sequence of length N. For example, sequences of words within texts are N-grams but the idea is much more general."
    ...and

    "An N-gram is like a moving window over a text, where N is the number of words in the window.
    igrams: two consecutive words
    Trigrams: three consecutive words
    Quadrigrams: four consecutive words
    And so on.
    N-gram analysis calculates the probability of a given sequence of words occurring: if you know what the first word of a pair is, how confidently can you predict the next one, using the conditional bigram probability $ P(w_n \vert w_{n-1})$."
    ).

    Possibly supporting my naive hypothesis that the worth of the nilsimsa value goes down as the length of the source text increases is this, from A site referenced in Digest::Nilsimsa:
    "The nilsimsa of these two codes is 92 on a scale of -128 to +128. That means that 36 bits are different and 220 bits the same. Any nilsimsa over 24 (which is 3 sigma) indicates that the two messages are probably not independently generated." because ( arguing against own previous) comparing two large and similar texts, one of which is distinguished only by a relatively small insert (and possible formatting changes which OP is seeking to exclude from the testing), would seem likely to yield a nilsims of well over 24.

      I downloaded Digest::Nilsimsa and started to toy around with it. What I don't understand, and the referenced site does not explain, is how to get from two 256-bit values to the scale of -128 to +128. Anyone have any thoughts? (I emailed the module author, but since this code is a couple of years old I have no expectation that I will reach him.)
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise
        Jeff
        back to a doc I don't fully grok... but...

        The scale -128 to +128 appears to be arbitrary and could just as well be 0-255.

        That inference comes from the author's statement re his example output from two slightly variant sources: "The nilsimsa of these two codes is 92 on a scale of -128 to +128. That means that 36 bits are different and 220 bits the same. Any nilsimsa over 24 (which is 3 sigma) indicates that the two messages are probably not independently generated."

        BTW, I attach high value to the observations (below) from BrowserUK and sfink (but am not sure I'm ready to buy (no offense intended, BrowserUK!) BUK's "no easy way" as (1)gospel nor (2, and more important) as any reason not to search for a way.

Re: Fingerprinting text documents for approximate comparison
by Eimi Metamorphoumai (Deacon) on Mar 24, 2005 at 16:30 UTC
    There's Digest::Nilsimsa, which might be worth looking into. I haven't tried it, but it sounds like it's designed to do what you're looking for.
Re: Fingerprinting text documents for approximate comparison
by polettix (Vicar) on Mar 24, 2005 at 16:53 UTC
    To have numbers representing distances between sentences you can go with Text::PhraseDistance, but you'll have to peruse the documentation to figure out how to get a percentage from these numbers (maybe a simple division by the number of words could suffice).

    Flavio

    Don't fool yourself.
Re: Fingerprinting text documents for approximate comparison
by perlfan (Vicar) on Mar 24, 2005 at 17:17 UTC
    I did a quick search, and came up with String::Approx, but its documentation states that:

    NOTE: String::Approx has been designed to work with strings, not with text. In other words, when you want to compare things like text or source code, consisting of words or tokens and phrases and sentences, or expressions and statements, you should probably use some other tool than String::Approx, like for example the standard UNIX diff(1) tool, or the Algorithm::Diff module from CPAN, or if you just want the Levenshtein edit distance (explained below), the Text::Levenshtein module from CPAN. See also Text::WagnerFischer and Text::PhraseDistance.

    So that might give you some other ideas.
      Hmm. Again, I want a "absolute" fingerprint (like a checksum), rather than a way to compare two given documents. LevenshteinXS ran for over two minutes comparing two documents.
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise
Re: Fingerprinting text documents for approximate comparison
by FitTrend (Pilgrim) on Mar 24, 2005 at 16:26 UTC

    A wacky idea may be to dump these changed files to a Subversion repository (source control system) using its command line functions. Then you can use perl to extract and perform DIFFs on these files to see what changes have been made to them (however small they are).

    This may minimize the amount of code you need to manage by relying on the capability of this system

    Alternatively (and possibly more fun), there are modules that perform DIFFs on files on CPAN. What comes to mind is TEXT::DIFF.

      Ack. No, not in the scope of what I'm talking about. I have thousands of these per day, and I don't want to compare every one to every other one.
      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise
Re: Fingerprinting text documents for approximate comparison
by halley (Prior) on Mar 24, 2005 at 19:41 UTC
    The crux of your problem is you want to compare a numerical distance from page A to page B without a lot of processing.

    Nilsimsa seems useless if you can't compare a numerical distance.

    Levenshtein seems useless if you can only compare relative distances by processing each pair of pages.

    When I hear "distance" I think Pythagorus.

    Some folks mentioned making a massive bit vector for nontrivial unique words, but having done something like this before, you quickly end up with bit vectors of thousands of words. And the number of times that 'vector' appears in a page may be important, so then you end up with count-vectors of thousands of words.

    You're on the right track. My method is closer to the way DNA gets compared, in comparing the prevalence of smaller and smaller substring fragments. You only need a small vector (e.g., 26 ints) for each page's "absolute" position in the global space. You then can compare two absolute positions in O(1) time.

    # for each page, # romanize anything you can romanize # remove subjective words if possible # canonicalize the page, strip formatting # remove all low-value words # dissolve the page into letters # form an absolute 26-space vector from remaining letters # do NOT normalize the 26-space vector # save the 26-space vector for this page

    Then, the "distance" or "difference" between two pages is just a simple comparison of two 26-space vectors. You may scale the importance of some letters higher than others, and you may choose or base your function on Manhattan, Euclidean or Pythagorean methods of measuring the distance.

    Don't normalize the 26-space vector, or you'll find that any non-trivial page matches an encyclopedia in terms of overall letter distribution.

    You can also do this on a chain approach: each natural section of a page (sentence, paragraph, title-to-title, etc.) can be fingerprinted separately, depending on the granularity you can afford to store and search. You can then search for fragments as well as whole pages. For example, one paragraph may be plagiarized from a different source.

    Naive searches are O(n) with n fingerprints, if you count the distance formula as O(1). I would think you could b-tree or octree or duodecasexigesima-tree (?) the vector search space for faster exclusions of very different fingerprints, and reduce that search pretty far.

    --
    [ e d @ h a l l e y . c c ]

Re: Fingerprinting text documents for approximate comparison
by ambrus (Abbot) on Mar 24, 2005 at 17:30 UTC

    My guess would be to extract all words from the web page and create a bit vector where you set a bit for each word with some hashing algorithm. Select the size of the bit vector in such a way that the ratio of bits set would not be close to 1. If the bit vector would be too large this way, store only a certain sized slice of it (thus including only those words whose hash value is in a certain interval). You can then count the bits that are in only one of the bit vectors to get the approximate distance of the texts.

    See also secondary indexing in Knuth volume 3, which surely teaches you much more about this topic.

Re: Fingerprinting text documents for approximate comparison
by planetscape (Chancellor) on Mar 27, 2005 at 10:03 UTC

    The techniques of Text Mining sound applicable to your quandry. When I read your question, I thought immediately of an article I'd read:

    Marc Damashek. Gauging Similarity with N-Grams: Language-Independent Categorization of Text. Science, Vol. 267, pp. 843-848, 10 February 1995.

    Similar articles include:

    Searching for text? Send an N-Gram! (includes related articles on implementing n-gram systems and n-gram vectors in C) (technical) Roy E. Kimbrell. Byte, May 1988 v13 n5 p297(9).

    Abstract: N-gram indexing systems are the best method of retrieving information from large full-text databases. An n-gram is a sequence of a specified number of characters occurring in a word. N-gram vectors must be derived for each document stored in order to set up a document-retrieval system using n-grams. N-gram indexing is computationally less intensive than keyword solutions, the next best alternative. N-gram systems are adaptable to several different situations and systems do not need to be re-indexed to answer completely new questions. N-grams are limited in that they are complicated, is memory- and processor-intensive, and is not exact {sic}.

    and:

    "One Size Fits All? A Simple Technique to Perform Several NLP Tasks." by Daniel Gayo-Avello, Darío Álvarez-Gutiérrez, and José Gayo-Avello.

    There are several Perl packages for working with N-Grams; you can search CPAN for them.

    I realize this is only a possible pointer in the right direction, but hope it helps.

    Nancy

    Addendum: This might be the best answer to your question: http://www.perlmonks.org/index.pl?node_id=32285

Re: Fingerprinting text documents for approximate comparison
by BrowserUk (Patriarch) on Mar 24, 2005 at 18:55 UTC

    To do this with any accuracy you need a pre-existing corpus of "typical data".

    The best signature of a given piece if text is the N-rarest words it contains, where 'rarest' is defined in terms of the frequency with which each word appears in your corpus of typical data.

    However, there is no easy way to convert that to a single numerical value that will allow your 'likelyhood" approximation. Even if you took two identical pieces of text that each carried the addition of, say, the transmitting bodies--eg. 'Reuters' and 'CNN'--then those additions will likely affect any reduction to a numerical value in a way that will make comparison very hard.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
Re: Fingerprinting text documents for approximate comparison
by johndageek (Hermit) on Mar 24, 2005 at 19:06 UTC
    I would look at creating a fingerprint file for each document (you will need to refine the parameters you use).

    In this file I would put perhaps:
    number of significant words
    average number of letters of the top 5 most common words
    The three least common significant words (alphabetized)
    The three most common significant words (alphabetic)

    You can either use your current checksum, or create a checksum on the fingerprint files.

    use similar checksums to select fingerprint files to compare, those fingerprints that are within a tolerance you set would be deemed matches.

    Jsut my 2 cents worth, good luck! <!--

    Enjoy!
    Dageek

Re: Fingerprinting text documents for approximate comparison
by sfink (Deacon) on Mar 24, 2005 at 19:03 UTC
    Keep the small words. If you want to drop stuff, then get a table of the N most common words in your target language (or corpus) and drop those. But you probably needn't bother.

    Pick some fairly short "chunk size". It'll need to be a little longer if you don't drop common words, but you'll probably just need to tune it experimentally.

    Hash every substring of that chunk size. That'll give you one hash per token in your input (approximately -- actually, a few less). If you use a rolling hash function, you can do this fast. For now, ignore the problem that this will be about four times larger than your original input.

    What you do next depends on whether you want to do this pairwise, or you want to find all of the most similar documents in a large corpus.

    Pairwise: hash the two documents and count up the percentage of hashes in common.

    Corpus: Make a giant hash table mapping each of these hashes to some id representing the document they came from. Hash the incoming document and look up each of the hashes in the table. Generate another table mapping document id => count of matching hashes. Sort this by count.

    This is still impractical because of the number of hashes. So randomly drop some proportion of them. Make the drop decision based only upon the value of the hash, so that you drop the same hashes from all the documents. (Instead of comparing entire elephants to elephants, you want to compare elephant legs and trunks to elephant legs and trunks. You do not want to compare legs to sometimes legs, sometimes tails.)

    There is a further refinement that eliminates degenerate cases, but it's patented and so you're probably better off not knowing about it. Perhaps you'll rediscover it on your own, but then you won't have this post sitting here to prove willful infringement.

    Search for "MOSS" and "agrep" for more algorithmic details. Useful names are "Alex Aiken" and "Udi Manber".

    Except that your name looks very familiar, so you probably already know all this. I think you may be the guy who asked me to call you a few months back. Sad how I'll happily reply to questions posted on this site, but never get around to replying to my personal email...

Re: Fingerprinting text documents for approximate comparison
by jhourcle (Prior) on Mar 24, 2005 at 23:55 UTC

    I don't have a direct solution, but I would think that similar algorithms to those used for spam fingerprinting might work. I think that Vipul's Razor uses this technique, as do Pyzor and Dcc.

    In looking through the Dcc website, it seems that the keyword you might want to try searching on is 'fuzzy matching'.

Re: Fingerprinting text documents for approximate comparison
by thekestrel (Friar) on Mar 24, 2005 at 21:07 UTC
    Hi,
    How about doing a diff on the two files then dividing the size of the diff with the size of the file to give you a percentage of similarity. You might want to prune the diff output so its only giving the results from the comparison file otherwise you're going to have a little twice the expected size in the diff as both results are posted plus some fluff from Diff.
    This is more relavant than just comparing the size of the files which is obviously not a solution in that is does give consideration to content.

    Regards Paul
      Urk. Again, any approach that requires comparing two documents directly is going to eat me alive, as I have thousands of these every day. N-squared, y'know.

      To restate my desired outcome: I can checksum each document, and then get a zero-or-one answer by comparing checksums. But what I really want is a "fuzzy" checksum, kind of like taking a thumbnail of an image and comparing the thumbnails. That led me to the approach of throwing out all the short words, whitespace, punctuation, etc. and checksumming the resulting string.

      --
      Jeff Boes
      Database Engineer
      Nexcerpt, Inc.
      vox 269.226.9550 ext 24
      fax 269.349.9076
       http://www.nexcerpt.com
      ...Nexcerpt...Connecting People With Expertise
        understood...
        What about compressing the document with something like huffman encoding which would then shorten all of the words, replacing them with keys for repeated instances so that would really compress the text. You could even then compare the 'keys' it uses as replacements for comparison of like text. Going further, but this might be pushing it is just store the keys it uses (i.e. the header from the compression) as these would be replaced based on frequency of use and then you could eliminate all of the short words then.
        Just a thought.. =)

        Regards Paul

        Update:
        You still would have to compare the 'signatures' which would still be very time consuming, but the only way I can see around this with this method is say you use this header that is generated and pick the top 10 most used chunks (as it can be words or phrases), and then sort them alphabetically. Then store your document in directories and subdirectories based off each word. i.e.

        \stars\moon\rocks\
        has RocksoftheMoon.doc and GanymedeGeology.doc
        for instance....
        This way you evaluate you document when you get it and store it in a specific place and like documents just end up in the same directory....or at least nearby..

        Regards Paul
Re: Fingerprinting text documents for approximate comparison
by graff (Chancellor) on Mar 25, 2005 at 10:13 UTC
    Here is a very simple-minded approach that is at least well motivated and is easy to implement and comprehend. Think of each letter as a "vector" that takes you some distance in a particular direction in a 26-dimension space; the more often you encounter a particular letter, the farther you go in that direction. If two documents contain the exact same set of letters, you'll end up at the same place after traversing each one. (If you like, you can keep case distinctions and/or include digits and even punctuation, to add more dimensions.)

    Most people who do this sort of thing do it terms of some sort of cosine or other trig-like metric, but for a bone-head approach that I'm making up from scratch, that's a bit beyond my ken (I never got that far in math -- or if I did, I've forgotten). So instead, I'll just assign a distinct "distance" (a distinct power of 2) to each letter, so that if two documents contain the same set of letters, I'll just traverse the same distance.

    Because it seems sensible, I'll assign distances to the letters in order of their likelihood of occurrence. Then, for each document, I just build up a character histogram, and sum up a distance score by multiplying each character's frequency of occurrence by it's assigned magnitude.

    To make things more efficient, I'll prepare the input data by building a lookup table for the docs: a simple list that gives a file name, a doc-ID string, and a starting and ending byte offset in the file. (Whether a given file contains one doc or many is not relevant -- I just need to make sure that each doc has a unique ID.)

    #!/usr/bin/perl use strict; die "Usage: $0 < doctable > sig.list\n" if ( @ARGV ); # a simple-minded attempt to generate a "signature" value # that could be used as a means of numerically measuring # document similarity # @letters is ordered according to relative letter frequency in # English news text (based on a 1-month sample of AP newswire): my @letters = qw/e a t i n o s r h l d c u m p g f w y b v k j x z q/; # %vectors stores a distinct power of 2 for each letter; # more frequent letters are assigned lower powers: my $i; my %vectors = map { $_ => 1<<$i++ } @letters; my ($openfn,$offset); # Input on stdin is one doc entry per line: # file.name doc_id begin_offset end_offset while (<>) { my ( $fn, $id, $so, $eo ) = split; if ( $fn ne $openfn ) { open IN, $fn or die "$fn: $!"; $openfn = $fn; $offset = 0; } seek( IN, $so, 0 ) if ( $so != $offset ); read( IN, $_, $eo-$so ); $offset = $eo; tr/A-Z/a-z/; tr/a-z//cd; my %c_hist = (); $c_hist{$_}++ for ( split // ); my $sig = 0; $sig += $c_hist{$_} * $vectors{$_} for ( keys %c_hist ); printf "%s\t%f\n", $id, $sig; }

    The output will be such that shorter docs will have smaller signature numbers and longer docs will have larger numbers; docs with equal numbers have the exact same inventory of letters, and ones with small differences in their scores will have very few letters different.

    (Well, there is a possibility that two completely unrelated docs might be anagrams of one another. And of course, a document with 2 "e"s will score the same as one with a single "a", other things being equal, and likewise for other equivalences. But for a first-pass discriminator, it works pretty well. I also tried it with prime numbers instead of powers of 2, and found that the primes yielded more "collisions" -- fewer distinct values on the same set of about 25K documents.)

    Another thing to point out is that this would be easy enough to implement in C to be worthwhile for the run-time you'd save.

    But that's the simple part. The hard part, especially if your handling docs that are being pulled off of web sites, is to make sure that you only compare the parts that matter.

    I presume you are already taking care to strip out markup, HTML comments, javascript, yadda yadda. But once you reduce the pages down to visible text content, you still need to strip away or simply avoid all the site-specific framework (page headers, footers, side bars, nav. bars, ad nauseum). If you're treating pages from multiple web sites, you need filters adapted to each one (and each filter is likely to have to change over time, as the web sites change their formats at whim). Good luck with that.

Re: Fingerprinting text documents for approximate comparison
by gam3 (Curate) on Mar 25, 2005 at 01:33 UTC
    Here is a very simplistic way you might try.
    #!/usr/bin/perl use strict; our $stoplist = { and => 0, or => 0, but => 0, }; use Digest::MD5 qw(md5 md5_hex md5_base64); sub main { my $line; my $words = {}; while ($line = shift) { chomp; my @data = split(/\b/, $line); for my $word (@data) { $word =~ s/\s*//g; chomp($word); next if length $word < 2; $words->{lc($word)}++; } } my @out; for my $key (keys %$words) { next if $words->{$key} < 2; if (defined $stoplist->{$key}) { next if $stoplist->{$key} == 0; next if $stoplist->{$key} > $words->{$key}; } push @out, $key; } print join('', sort @out), "\n"; print md5_base64(join(' ', sort @out)), "\n"; } main(<<EOP); The "Digest::MD5" module allows you to use the RSA Data Security Inc. +MD5 Message Digest algorithm from within Perl programs. The algorithm takes as input a m +essage of arbitrary length and produces as output a 128-bit "fingerprint" or "message dige +st" of the input. EOP main(<<EOP); The "digest::MD5" Module allows you to use the RSA Data Security Inc. MD5 Message Digest algorithm from within Perl programs. The algorithm takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" and "message digest" of the input. EOP EOP
    The output:
    algorithmasdigestinputmd5messageofthe
    PyVzoLxidA4SklaM0RsrhQ
    algorithmasdigestinputmd5messageofthe
    PyVzoLxidA4SklaM0RsrhQ
    
    But looking at spam code is probably the way to go.

    Update: While the idea was sound the code did not run correctly.

    -- gam3
    A picture is worth a thousand words, but takes 200K.

      That isn't going to be useful. the md5 algorithm is expressly design to detect differences, not similarity:

      use Digest::MD5 qw[md5_hex]; my $s = 'the quick brown fox jumps over the lazy dog'; print md5_hex $s; 77add1d5f41223d5582fca736a5cb335 print md5_hex $s . 's'; 5e48a737eaff799917707b2815af10fc print md5_hex $s . 'S'; d02763729a741eed14417a1051ec228c

      Even the addition of a single character, or changing a single bit produces a (numerically) completely unrelated digest--exactly as it should for the purposes for which md5 is designed, but completely wrong for this application.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco.
      Rule 1 has a caveat! -- Who broke the cabal?
        The MD5 is only turning a list of words into a number. It is the list of words that is the fingerprint of the file. You could just compare the words. The MD5 is just being used as a checksum.
        -- gam3
        A picture is worth a thousand words, but takes 200K.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://442095]
Approved by RazorbladeBidet
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-04-18 03:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found