comment on

I'm hoping if I can feed a module 1000 lines of Slinker and 1000 lines of Stinker and have it say something like "These two files have a Herzenberger-Foogenboogen Written English Similarity Rating of 97%"?

Two academics came up with a clever use of zip-based compression for doing this type of analysis. Their scheme, which they first developed to do automatic language detection, but which is also useful for determining authorship, is glossed over here.

Basically, they noted that if you had a chunk of text from some author who was unknown, but who was a member of a known set, and if you had sample texts from each author in the set, you could concatenate the unknown text with text from each author, looking for the concatenation that compressed best.

It's a clever approach, and easily implemented in Perl.

In reply to Re: Text Analysis Tools to compare Slinker and Stinker? by dws
in thread Text Analysis Tools to compare Slinker and Stinker? by Cody Pendant

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks