Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Text Analysis Tools to compare Slinker and Stinker?

by pg (Canon)
on Jan 22, 2003 at 04:15 UTC ( [id://228936]=note: print w/replies, xml ) Need Help??


in reply to Text Analysis Tools to compare Slinker and Stinker?

To be frank, no liguistic analysis solution would be MEANINGFUL and HELPFUL in this case, doesn't matter whether we have a good liguistic analysis solution.

Think about this at a higher level, and don't sink into technical details too quick. This is actually a good example where TECHNOLOGY does not help with SOCIAL issues.

Think about this, whatever how prefect the analysis tool is, it would require a big amount of input to yield any MEANINGFUIL result. The reality is that, if Slinker behaves in the same way as Stinker, doesn't matter whether they are one person, most likely, long before your tool give you any MEANINGFUL result to JUSTIFY your decision, you have banned Slinker already.

On the other hand, if Slinker behaves better, even your nice liguistic analysis tool figures out that Slinker is Stincker, there is still no JUSTIFIED reason for you to ban him. In this case, the only thing a technically capable tool does is, to create negative social feeling.

Summary:

Slinker behaves bad Slinker behaves good (I cannot use the word "better" here, as that is logically wrong unless Stinker is in fact Slinker)
Analysis tool says S == S Takes large amount of data to analysis, most likely, your emotion would help you to make a decision much quicker Yes, he is the same person, but you don't have a reason to ban him, the revelation only affects everyone's feeling in a negative way
Analysis tool says S != S Stinker still would be banned, doesn't matter whether the result from your tool is correct, your bad feeling would take care of this Too obvious, the analysis is totally a waste of time
  • Comment on Re: Text Analysis Tools to compare Slinker and Stinker?

Replies are listed 'Best First'.
Re: Re: Text Analysis Tools to compare Slinker and Stinker?
by Cody Pendant (Prior) on Jan 22, 2003 at 04:42 UTC
    It's not that I don't appreciate the effort, but I'm going to have to ask people to stop trying to help me with the social and administrative aspects of my problem, really.

    I won't explain the rules of the community involved, that would be silly. But if we were convinced that the two people were the same, action would be taken, that's all you need to know.

    If a text-analysis tool proved that the two had very similar writing styles, on a level where it was 1000-to-one that it was coincidental, then that would be considered proof.

    But, having used the Fathom module, see above, I've got nothing conclusive, I'm afraid. It's a very useful tool but hasn't proven or disproven anything. There are fewer differences between two randomly-chosen posters than between Slinker and Stinker, it turns out.

    Another angle of attack on this problem, which I hadn't thought of before, is mis-spellings -- Slinker has spelt "happening" as "happenning" twice, but Stinker gets it right every time...
    --
    “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

      This is a great idea for such a problem as yours. Combining readability, tupples, fathom etc, with misspellings (or is it mispellings or missspellings or ...) I wonder how successful we could get a module for comparing two texts. I might take a look at that sometime in the next few weeks. I really think that misspellings might be a great key to comparing two texts. Judging from the above information, I'd have to guess that stinker != slinker. It would be unusually difficult to fix spellings just to get back into a web-community. (IMHO)
        This came up a few years ago in another forum I've been a part of, on the general issue of recognising anonomous text not for a particular issue in the forum, after something like that happened in the news.

        I found I was able to write in a manner which neither person nor software was able to correctly match up with my reference material. About 20% of the people who tried had the same results. Others were matched and were often surprised by what tripped them up when we posted our guesses.

        Some people used the very tools under discussion to pre-check their work before posting the anonomous sample. Naturally, they showed non-match in the computer's guess. I furthermore used writing constructs that are among my pet peeves, and a simpler vocabulary (as measured by a reading-level tool), and tripped up the human guessers as well. I think keeping the reading "level" down helped the automatic scans too, since the simpler text has more in common with all text.

        BTW, most everyone who tried were successful (published, that is) writers.

        —John

        'Misspellings' are precisely where Bayesian filtering, once trained, will help tremendously (though as others have pointed out, never conclusively).

        As an example from the anti-spam efforts, once Bayesian filtering was enabled they were amazed that single token with the highest probability of indicating spam was 'FF0000', the hex value for bright red. Unexpected, but damning.

        Consistently misspellt words could show up accordingly.

        Mattt

        The only part of the process that I'm not confident about is the control.

        Say I compare Slinker and Stinker, and they have almost exactly the same average sentence length, FOG readability index and so on, how do I know I wouldn't get the same result comparing Slinker with Ernest Hemingway or Toni Morrison or Irvine Welch?

        You'd need to be able to say with confidence that if author A scores a 97% similarity score with author B, then you couldn't get the same result with author X.

        The mis-spellings ought to be quite easy to implement though:

        Author A makes the following mistakes every time.
        Author B makes the following mistakes every time.

        That would convince me...
        --
        “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

      I am not fighting against you (I am 100% sincere), but did you realize that, actually you are not trying to find a "good" tool, but trying to find a tool to "conclusively" satisfy your guess, and to convince your community members and yourself to "believe" something you already pre-determined.

      No good tool goes against your guess, would be a good tool in this situation.

      I am just telling the truth, although it might be difficult to ... ;-)
        OK, as you won't give up, pg, here are the rules in question:
        1. Bad behaviour gets you a first warning
        2. If you don't improve after a second warning, you get a two-month suspension
        3. If you attempt to rejoin the community under another name while suspended, no matter how well you behave, you get banned

        I really think these are fair rules. And they're stated upfront.

        But no matter what rules we choose, the facts are this:

        • We suspect someone of lying about who they are.
        • When you suspect someone is lying, asking them "hey, are you lying?" is not a logical way to find out.
        • Linguistic analysis is. And there are great Perl modules for it.

        You should be happy with the outcome anyway pg, because as far as I'm concerned, with the help of Perl, I'm now satisfied that these two people aren't the same. It's like one of those annoying lawyer shows where they prove the guy innocent.
        --
        “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

      But, having used the Fathom module, see above, I've got nothing conclusive, I'm afraid. It's a very useful tool but hasn't proven or disproven anything. There are fewer differences between two randomly-chosen posters than between Slinker and Stinker, it turns out.

      Another angle of attack on this problem, which I hadn't thought of before, is mis-spellings -- Slinker has spelt "happening" as "happenning" twice, but Stinker gets it right every time...

      Leaving alone the issue of whether it is really worth it to spend a lot of time on this mystery, testing services have dealt with some aspects of your problem. Especially the personality tests where they ask you the same question in many slightly different ways and perform some kind of analysis to determine whether you are trying to spoof the test by appearing to be someone you are not.

      Your mention of a spelling discrepency brought to mind a scene from The Princess Bride where Westley was to add poison to one of the drinks, and his adversary was to choose, after Westley had shifted (or not) the position of the glasses. The bad guy goes through a series of qustions and answers trying to figure out Westley's thoughts -- "You placed the poisoned glass closer to me so I'd choose it. But I'm too smart for that, so it must be the one closest to you... But you knew I'd anticipate that move, so it must be the one closest to me after all." And so on for a few minutes or pretty funny dialogue. (I'm sure I got the details turned around, but you get the gist)

      Is this guy deliberately mispelling a word or two just to throw you off? Does it really matter? It still boils down to a guess, doesn't it?

      Even after centuries of linguistic analysis, and lately with some fairly sophisticated computer analysis, scholars are still arguing whether Marlowe wrote the works attributed to Shakespeare, or whether Shakespeare was, indeed, Shakespeare.

      -----
      "Computeri non cogitant, ergo non sunt"

        >a scene from The Princess Bride
        Man in black:  (turning his back, and adding the poison to one of the goblets)
        	Alright, where is the poison?  The battle of wits has begun.  It ends
        	when you decide and we both drink - and find out who is right, and who
        	is dead.
        Vizzini:  But it's so simple.  All I have to do is divine it from what I know of
        	you.  Are you the sort of man who would put the poison into his own
        	goblet or his enemy's? Now, a clever man would put the poison into his
        	own goblet because he would know that only a great fool would reach for
        	what he was given.  I am not a great fool so I can clearly not choose
        	the wine in front of you...But you must have known I was not a great
        	fool; you would have counted on it, so I can clearly not choose the wine
        	in front of me.
        Man in black:  You've made your decision then?
        Vizzini:  (happily) Not remotely!  Because Iocaine comes from Australia.  As
        	everyone knows, Australia is entirely peopled with criminals.  And
        	criminals are used to having people not  trust them, as you are not
        	trusted by me.	So, I can clearly not choose the wine in front of you.
        Man in black:  Truly, you have a dizzying intellect.
        Vizzini:  Wait 'till I get going!!  ...where was I?
        Man in black:  Australia.
        Vizzini:  Yes! Australia!  And you must have suspected I would have known the
        	powder's origin,so I can clearly not choose the wine in front of me.
        Man in black:  You're just stalling now.
        Vizzini:  You'd like to think that, wouldn't you!  You've beaten my giant, which
        	means you're exceptionally strong...so you could have put the poison in
        	your own goblet trusting on your strength to save you, so I can clearly
        	not choose the wine in front of you.  But, you've also bested my
        	Spaniard, which means you must have studied...and in studying you must
        	have learned that man is mortal so you would have put the poison as far
        	from yourself as possible, so I can clearly not choose the wine in front
        	of me!
        
        Man in black:  You're trying to trick me into giving away something.  It won't
        	work.
        Vizzini:  It has worked!  You've given everything away! I know where the poison
        	is!
        Man in black:  Then make your choice.
        Vizzini:  I will, and I choose...(pointing behind the man in black) What in the
        	world can that be?
        Man in black:  (turning around, while Vizzini switches goblets) What?! Where?! I
        	don't see anything.
        Vizzini:  Oh, well, I...I could have sworn I saw something. No matter.	(Vizzini
        	laughs)
        Man in black:  What's so funny?
        Vizzini:  I...I'll tell you in a minute.  First, lets drink, me from my glass
        	and you from yours.
        
        (They both drink)
        


        is that the one you meant?

        I still maintain that this was in interesting exercise.

        I did one other thing, which was brute-force but also interesting.

        I grabbed every 2-char string from the posters, put them in a hash with number of occurrences, sorted the results by number, and compared the most popular 1,000 2-char strings from the suspect posters with the most popular 2-char strings from "real" posters. Again, the results were inconclusive.

        Slinker and Stinker shared 75% of the most-popular-strings, but another poster shared 68%, so it wasn't very dramatic evidence either way.
        --
        “Every bit of code is either naturally related to the problem at hand, or else it's an accidental side effect of the fact that you happened to solve the problem using a digital computer.” M-J D

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://228936]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (6)
As of 2024-04-23 09:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found