Theory time: Sentence equivalence

BUU has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Theory time: Sentence equivalence by Ovid (Cardinal) on Dec 17, 2005 at 16:49 UTC
While the name doesn't suggest being useful, you might find Lingua::LinkParser beneficial. It not only parses sentences, but also identifies each word's role in the sentence -- or how the words link together. Thus, you'll know a sentence about the "dog days of summer" isn't about dogs. You can read a detailed intro here. The module (the core of which integrates with C code that you install separately), generates data structures that break a sentence down into component parts, though there's a bit of work you' have to do to wade through all the types it lists. Here's a short example (the module has docs for all of those link types): use Lingua::LinkParser; use strict; my $parser = new Lingua::LinkParser; # create the parser my $text = "Moses supposes his toses are roses."; my $sentence = $parser->create_sentence($text); # parse the sentence my $linkage = $sentence->linkage(1); # use the first linkage print $parser->get_diagram($linkage); # print it out __END__ +-------------------------Xp------------------------+ \| +------Ce------+ \| +---Wd--+---Ss---+ +--Dmc--+--Spx--+--Opt-+ \| \| \| \| \| \| \| \| \| LEFT-WALL Moses supposes.v his toses[!].n are.v roses.n . [download] Cheers, Ovid New address of my CGI Course.	[reply] [d/l]
Re: Theory time: Sentence equivalence by BrowserUk (Patriarch) on Dec 17, 2005 at 15:04 UTC
You may think that by asking a limited question, "Are these sentences equivalent", you are avoiding most of the problems that come with natural language processing. That simple question has only black and white answers, and that requires 100% certainty. And that implies a very great level of understanding. For example: This is the thing. This isn't the thing. The thing is this. Isn't this the thing. Four sentences of just four words each and only two characters separating them, but four completely distinct meanings. And you could substitute any noun for 'thing', and the sentences remain equally distinct. I do not think that there is any statistical measure or weighting algorithm that will allow you determine their equivalence or lack thereof, nor even give a probability of equivalence that would be useful in determining that. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: Theory time: Sentence equivalence by Not_a_Number (Prior) on Dec 17, 2005 at 21:32 UTC
Pointers to research? At one level, this is what research in computational linguistics has been working on for years. Before attempting to invent (note that I didn't say reinvent :) this particular wheel by yourself, try googling for "constraint based grammars", "machine translation", "semantic equivalence" or whatever. Moreover, I think you have a misconception over what constitutes "equivalence". The non-trivial issues of word order and punctuation have been pointed out above. A further problem is that sentences can be "equivalent", that not only have different grammatical structures but also contain very different words. For example: - In New York, following the latest Fed rate cut, stocks rose across the board. - The Federal Bank's further lowering of base rates boosted the NYSE and the NASDAQ. - Wall Street reacted positively after Greenspan reduced interest rates again. may be considered strictly equivalent, at least for some definition of equivalence, despite containing very few words in common. Anyway, best of luck in your project. I, for one, would be very interested in seeing your results. dave Update: Fixed typos, changed sample sentences slightly to make them more "equivalent".	[reply]
Re^2: Theory time: Sentence equivalence by eweaver (Sexton) on Dec 18, 2005 at 07:34 UTC
I don't think you realize what you've gotten into. Research into natural language processing (the official term) is at least 40 years old. Browse on citeulike for NLP, NLG (natural language generation), and NLU (...understanding). Also search Google Scholar. The topic is extremely theoretical and heavily realizes on Chompsky's theories of grammars, as well as tree theory. There are two main approaches: McKeown's (a name you will see a lot) is usually based on nested templates (IIRC) and requires a massive database of grammatical structures. Other approaches rely on deep semantic representations of the knowledge that is implied by grammatical structures. A couple of things that you have failed to consider, each of which has massive amounts of discussion in the literature: - Referring expressions (him, her, that, etc.) - Focus (what's the subject? Is the sentence really _about_ the subject?) - homonymns (semantic homonymns, at least) - fifty billion other things... The dismal state of machine translation ought to indicate how far yet we have to go. Babelfish is rather state of the art, actually. Good luck. ~e	[reply]
Re^3: Theory time: Sentence equivalence by tphyahoo (Vicar) on Dec 18, 2005 at 11:43 UTC
On mondays, I think linguists -- Chomsky, Pinker, the lot of them -- are pseudoscientists peddling bumhug. Kind of like certain bad apple social scientists and continental philosophers -- see The Sokal Affair. On tuesdays, I think maybe linguists are more like physicists than the wizards sokal pulled back the curtain on. Enh, who knows. But a good place to start on the bad news of solving linguistics problems with computing is Pinker's The Language Instinct. Bumhug he may be, on mondays, but I liked the book it a lot anyways :)	[reply]
Re^3: Theory time: Sentence equivalence by DrWhy (Chaplain) on Dec 19, 2005 at 17:50 UTC
Babelfish is rather state of the art, actually ...mmmm not really. Babelfish/Systran may be the biggest fish in the pond commerically, but they are hardly state of the art. The big splashes technically are being made by people looking at statistical techniques such as Language Weaver (the company I work for), ISI, and Google. As for the original topic. I will just echo some of the other posters and warn you that you are taking some tiny first (mis?)steps on a journey whose destination is a long ways off. This is an intensely complex and interesting topic you could spend your entire life on if you are sufficiently interested/motivated/compensated. --DrWhy "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."	[reply]
Re: Theory time: Sentence equivalence by jZed (Prior) on Dec 17, 2005 at 16:10 UTC
Don't forget to deal with word-order: Man bites dog. ... Dog bites man.	[reply]
Re: Theory time: Sentence equivalence by CountZero (Bishop) on Dec 17, 2005 at 20:07 UTC
And punctuation: `But father says: "John you are sitting on it".` `"But father," says John, "you are sitting on it".` CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: Theory time: Sentence equivalence by TedPride (Priest) on Dec 17, 2005 at 14:37 UTC
Large sets are actually easier to use, from a searching standpoint. As each sentence is entered, identify what part of speech each word is (I assume you'll be doing that already). Now store word counts for each word as each part of speech, along with a list of the sentences that word belongs to. I LOVE bread and butter. LOVE is beautiful. Love in one sentence is a verb and in the other a subject. The two should be kept separate. As your sample grows, you should be able to get fairly accurate matches by adding up the weights for each word/part of speech x the number of times the word appears in the sentence. You only need to look at sentences containing key words and a match percentage over a certain level, which means that your heavy-duty algorithm will probably never need to do more than a few dozen sentences even with hundreds of thousands of sentences.	[reply]
Re: Theory time: Sentence equivalence by eweaver (Sexton) on Dec 18, 2005 at 07:37 UTC
In terms of technical implementation of your thing, why don't you concatenate the stems together in alphabetical order (perhaps with a delimiter) and use that as a hash key to matching sentences? That should scale pretty well, and be fast to search. ~e	[reply]
Re: Theory time: Sentence equivalence by dimar (Curate) on Dec 20, 2005 at 15:29 UTC
In addition to the excellent posts in this thread, there are the corrolary responses in The (futile?) quest for an automatic paraphrase engine. Eventually, someone is going to review all the musings on this Very Hard Question, write a book on it, and solve it. Or not. =oQDlNWYsBHI5JXZ2VGIulGIlJXYgQkUPxEIlhGdgY2bgMXZ5VGIlhGV	[reply]