Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Re: Text Analysis Tools to compare Slinker and Stinker?

by Cody Pendant (Prior)
on Jan 22, 2003 at 04:42 UTC ( [id://228941]=note: print w/replies, xml ) Need Help??


in reply to Re: Text Analysis Tools to compare Slinker and Stinker?
in thread Text Analysis Tools to compare Slinker and Stinker?

It's not that I don't appreciate the effort, but I'm going to have to ask people to stop trying to help me with the social and administrative aspects of my problem, really.

I won't explain the rules of the community involved, that would be silly. But if we were convinced that the two people were the same, action would be taken, that's all you need to know.

If a text-analysis tool proved that the two had very similar writing styles, on a level where it was 1000-to-one that it was coincidental, then that would be considered proof.

But, having used the Fathom module, see above, I've got nothing conclusive, I'm afraid. It's a very useful tool but hasn't proven or disproven anything. There are fewer differences between two randomly-chosen posters than between Slinker and Stinker, it turns out.

Another angle of attack on this problem, which I hadn't thought of before, is mis-spellings -- Slinker has spelt "happening" as "happenning" twice, but Stinker gets it right every time...
--
“Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

  • Comment on Re: Re: Text Analysis Tools to compare Slinker and Stinker?

Replies are listed 'Best First'.
Re: Re: Re: Text Analysis Tools to compare Slinker and Stinker?
by BigLug (Chaplain) on Jan 22, 2003 at 05:15 UTC
    This is a great idea for such a problem as yours. Combining readability, tupples, fathom etc, with misspellings (or is it mispellings or missspellings or ...) I wonder how successful we could get a module for comparing two texts. I might take a look at that sometime in the next few weeks. I really think that misspellings might be a great key to comparing two texts. Judging from the above information, I'd have to guess that stinker != slinker. It would be unusually difficult to fix spellings just to get back into a web-community. (IMHO)
      This came up a few years ago in another forum I've been a part of, on the general issue of recognising anonomous text not for a particular issue in the forum, after something like that happened in the news.

      I found I was able to write in a manner which neither person nor software was able to correctly match up with my reference material. About 20% of the people who tried had the same results. Others were matched and were often surprised by what tripped them up when we posted our guesses.

      Some people used the very tools under discussion to pre-check their work before posting the anonomous sample. Naturally, they showed non-match in the computer's guess. I furthermore used writing constructs that are among my pet peeves, and a simpler vocabulary (as measured by a reading-level tool), and tripped up the human guessers as well. I think keeping the reading "level" down helped the automatic scans too, since the simpler text has more in common with all text.

      BTW, most everyone who tried were successful (published, that is) writers.

      —John

      'Misspellings' are precisely where Bayesian filtering, once trained, will help tremendously (though as others have pointed out, never conclusively).

      As an example from the anti-spam efforts, once Bayesian filtering was enabled they were amazed that single token with the highest probability of indicating spam was 'FF0000', the hex value for bright red. Unexpected, but damning.

      Consistently misspellt words could show up accordingly.

      Mattt

        Hmm, I don't know about the rest of you, but when I'm typing a lot, I generally don't make the same mistakes. Ok, when someone consistently misspells the same word, its probably because they don't know how to spell it, but what I'm getting at is, that I more often than not just hit the keys in the wrong order.
        When there's a lot going on in a forum (or Mud, whatever) then one tends to get out as much as possible in order to keep up, which produces a lot of inconsistent misspellings. (It does in my case anyway, I'd be glad to prove it).
        Which makes me think that comparing misspellings is not such a good way to do it.

        Example: German word 'erzaehl', comes out as 'erzahle', 'erzeahl', 'erzaelh' etc. (I've even made aliases for most of these ;)

        C.

      The only part of the process that I'm not confident about is the control.

      Say I compare Slinker and Stinker, and they have almost exactly the same average sentence length, FOG readability index and so on, how do I know I wouldn't get the same result comparing Slinker with Ernest Hemingway or Toni Morrison or Irvine Welch?

      You'd need to be able to say with confidence that if author A scores a 97% similarity score with author B, then you couldn't get the same result with author X.

      The mis-spellings ought to be quite easy to implement though:

      Author A makes the following mistakes every time.
      Author B makes the following mistakes every time.

      That would convince me...
      --
      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Re: Re: Re: Text Analysis Tools to compare Slinker and Stinker?
by pg (Canon) on Jan 22, 2003 at 05:18 UTC
    I am not fighting against you (I am 100% sincere), but did you realize that, actually you are not trying to find a "good" tool, but trying to find a tool to "conclusively" satisfy your guess, and to convince your community members and yourself to "believe" something you already pre-determined.

    No good tool goes against your guess, would be a good tool in this situation.

    I am just telling the truth, although it might be difficult to ... ;-)
      OK, as you won't give up, pg, here are the rules in question:
      1. Bad behaviour gets you a first warning
      2. If you don't improve after a second warning, you get a two-month suspension
      3. If you attempt to rejoin the community under another name while suspended, no matter how well you behave, you get banned

      I really think these are fair rules. And they're stated upfront.

      But no matter what rules we choose, the facts are this:

      • We suspect someone of lying about who they are.
      • When you suspect someone is lying, asking them "hey, are you lying?" is not a logical way to find out.
      • Linguistic analysis is. And there are great Perl modules for it.

      You should be happy with the outcome anyway pg, because as far as I'm concerned, with the help of Perl, I'm now satisfied that these two people aren't the same. It's like one of those annoying lawyer shows where they prove the guy innocent.
      --
      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Re: Re: Re: Text Analysis Tools to compare Slinker and Stinker?
by cadfael (Friar) on Jan 23, 2003 at 03:05 UTC
    But, having used the Fathom module, see above, I've got nothing conclusive, I'm afraid. It's a very useful tool but hasn't proven or disproven anything. There are fewer differences between two randomly-chosen posters than between Slinker and Stinker, it turns out.

    Another angle of attack on this problem, which I hadn't thought of before, is mis-spellings -- Slinker has spelt "happening" as "happenning" twice, but Stinker gets it right every time...

    Leaving alone the issue of whether it is really worth it to spend a lot of time on this mystery, testing services have dealt with some aspects of your problem. Especially the personality tests where they ask you the same question in many slightly different ways and perform some kind of analysis to determine whether you are trying to spoof the test by appearing to be someone you are not.

    Your mention of a spelling discrepency brought to mind a scene from The Princess Bride where Westley was to add poison to one of the drinks, and his adversary was to choose, after Westley had shifted (or not) the position of the glasses. The bad guy goes through a series of qustions and answers trying to figure out Westley's thoughts -- "You placed the poisoned glass closer to me so I'd choose it. But I'm too smart for that, so it must be the one closest to you... But you knew I'd anticipate that move, so it must be the one closest to me after all." And so on for a few minutes or pretty funny dialogue. (I'm sure I got the details turned around, but you get the gist)

    Is this guy deliberately mispelling a word or two just to throw you off? Does it really matter? It still boils down to a guess, doesn't it?

    Even after centuries of linguistic analysis, and lately with some fairly sophisticated computer analysis, scholars are still arguing whether Marlowe wrote the works attributed to Shakespeare, or whether Shakespeare was, indeed, Shakespeare.

    -----
    "Computeri non cogitant, ergo non sunt"

      >a scene from The Princess Bride
      Man in black:  (turning his back, and adding the poison to one of the goblets)
      	Alright, where is the poison?  The battle of wits has begun.  It ends
      	when you decide and we both drink - and find out who is right, and who
      	is dead.
      Vizzini:  But it's so simple.  All I have to do is divine it from what I know of
      	you.  Are you the sort of man who would put the poison into his own
      	goblet or his enemy's? Now, a clever man would put the poison into his
      	own goblet because he would know that only a great fool would reach for
      	what he was given.  I am not a great fool so I can clearly not choose
      	the wine in front of you...But you must have known I was not a great
      	fool; you would have counted on it, so I can clearly not choose the wine
      	in front of me.
      Man in black:  You've made your decision then?
      Vizzini:  (happily) Not remotely!  Because Iocaine comes from Australia.  As
      	everyone knows, Australia is entirely peopled with criminals.  And
      	criminals are used to having people not  trust them, as you are not
      	trusted by me.	So, I can clearly not choose the wine in front of you.
      Man in black:  Truly, you have a dizzying intellect.
      Vizzini:  Wait 'till I get going!!  ...where was I?
      Man in black:  Australia.
      Vizzini:  Yes! Australia!  And you must have suspected I would have known the
      	powder's origin,so I can clearly not choose the wine in front of me.
      Man in black:  You're just stalling now.
      Vizzini:  You'd like to think that, wouldn't you!  You've beaten my giant, which
      	means you're exceptionally strong...so you could have put the poison in
      	your own goblet trusting on your strength to save you, so I can clearly
      	not choose the wine in front of you.  But, you've also bested my
      	Spaniard, which means you must have studied...and in studying you must
      	have learned that man is mortal so you would have put the poison as far
      	from yourself as possible, so I can clearly not choose the wine in front
      	of me!
      
      Man in black:  You're trying to trick me into giving away something.  It won't
      	work.
      Vizzini:  It has worked!  You've given everything away! I know where the poison
      	is!
      Man in black:  Then make your choice.
      Vizzini:  I will, and I choose...(pointing behind the man in black) What in the
      	world can that be?
      Man in black:  (turning around, while Vizzini switches goblets) What?! Where?! I
      	don't see anything.
      Vizzini:  Oh, well, I...I could have sworn I saw something. No matter.	(Vizzini
      	laughs)
      Man in black:  What's so funny?
      Vizzini:  I...I'll tell you in a minute.  First, lets drink, me from my glass
      	and you from yours.
      
      (They both drink)
      


      is that the one you meant?

      I still maintain that this was in interesting exercise.

      I did one other thing, which was brute-force but also interesting.

      I grabbed every 2-char string from the posters, put them in a hash with number of occurrences, sorted the results by number, and compared the most popular 1,000 2-char strings from the suspect posters with the most popular 2-char strings from "real" posters. Again, the results were inconclusive.

      Slinker and Stinker shared 75% of the most-popular-strings, but another poster shared 68%, so it wasn't very dramatic evidence either way.
      --
      “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://228941]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (4)
As of 2024-04-26 05:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found