XP is just a number | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
I'm hoping if I can feed a module 1000 lines of Slinker and 1000 lines of Stinker and have it say something like "These two files have a Herzenberger-Foogenboogen Written English Similarity Rating of 97%"?
Two academics came up with a clever use of zip-based compression for doing this type of analysis. Their scheme, which they first developed to do automatic language detection, but which is also useful for determining authorship, is glossed over here. Basically, they noted that if you had a chunk of text from some author who was unknown, but who was a member of a known set, and if you had sample texts from each author in the set, you could concatenate the unknown text with text from each author, looking for the concatenation that compressed best. It's a clever approach, and easily implemented in Perl.
In reply to Re: Text Analysis Tools to compare Slinker and Stinker?
by dws
|
|