Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Re: Re: Text Analysis Tools to compare Slinker and Stinker?

by BigLug (Chaplain)
on Jan 22, 2003 at 05:15 UTC ( [id://228945]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Text Analysis Tools to compare Slinker and Stinker?
in thread Text Analysis Tools to compare Slinker and Stinker?

This is a great idea for such a problem as yours. Combining readability, tupples, fathom etc, with misspellings (or is it mispellings or missspellings or ...) I wonder how successful we could get a module for comparing two texts. I might take a look at that sometime in the next few weeks. I really think that misspellings might be a great key to comparing two texts. Judging from the above information, I'd have to guess that stinker != slinker. It would be unusually difficult to fix spellings just to get back into a web-community. (IMHO)

Replies are listed 'Best First'.
Re: Re: Re: Re: Text Analysis Tools to compare Slinker and Stinker?
by John M. Dlugosz (Monsignor) on Jan 22, 2003 at 07:13 UTC
    This came up a few years ago in another forum I've been a part of, on the general issue of recognising anonomous text not for a particular issue in the forum, after something like that happened in the news.

    I found I was able to write in a manner which neither person nor software was able to correctly match up with my reference material. About 20% of the people who tried had the same results. Others were matched and were often surprised by what tripped them up when we posted our guesses.

    Some people used the very tools under discussion to pre-check their work before posting the anonomous sample. Naturally, they showed non-match in the computer's guess. I furthermore used writing constructs that are among my pet peeves, and a simpler vocabulary (as measured by a reading-level tool), and tripped up the human guessers as well. I think keeping the reading "level" down helped the automatic scans too, since the simpler text has more in common with all text.

    BTW, most everyone who tried were successful (published, that is) writers.

    —John

Re^4: Text Analysis Tools to compare Slinker and Stinker?
by mojotoad (Monsignor) on Jan 22, 2003 at 06:57 UTC
    'Misspellings' are precisely where Bayesian filtering, once trained, will help tremendously (though as others have pointed out, never conclusively).

    As an example from the anti-spam efforts, once Bayesian filtering was enabled they were amazed that single token with the highest probability of indicating spam was 'FF0000', the hex value for bright red. Unexpected, but damning.

    Consistently misspellt words could show up accordingly.

    Mattt

      Hmm, I don't know about the rest of you, but when I'm typing a lot, I generally don't make the same mistakes. Ok, when someone consistently misspells the same word, its probably because they don't know how to spell it, but what I'm getting at is, that I more often than not just hit the keys in the wrong order.
      When there's a lot going on in a forum (or Mud, whatever) then one tends to get out as much as possible in order to keep up, which produces a lot of inconsistent misspellings. (It does in my case anyway, I'd be glad to prove it).
      Which makes me think that comparing misspellings is not such a good way to do it.

      Example: German word 'erzaehl', comes out as 'erzahle', 'erzeahl', 'erzaelh' etc. (I've even made aliases for most of these ;)

      C.

Re: Re: Re: Re: Text Analysis Tools to compare Slinker and Stinker?
by Cody Pendant (Prior) on Jan 22, 2003 at 06:25 UTC
    The only part of the process that I'm not confident about is the control.

    Say I compare Slinker and Stinker, and they have almost exactly the same average sentence length, FOG readability index and so on, how do I know I wouldn't get the same result comparing Slinker with Ernest Hemingway or Toni Morrison or Irvine Welch?

    You'd need to be able to say with confidence that if author A scores a 97% similarity score with author B, then you couldn't get the same result with author X.

    The mis-spellings ought to be quite easy to implement though:

    Author A makes the following mistakes every time.
    Author B makes the following mistakes every time.

    That would convince me...
    --
    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://228945]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2024-03-28 18:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found