Re: The (futile?) quest for an automatic paraphrase engine

Well, if you can guarantee that the specimen encapsulates all of the grammar rules you're likely to find, and the sentences themselves consist of the only patterns you're going to find, then you can brute-force some perl out that's not too painful. Woodenly using your input sample as THE pattern, a 30-line script can blindly cobble together this kind of output (not perfect but close):

Seoul: population  more than 10.2 million
Seoul: capital  South Korea
Seoul: is  world's largest city  terms  population.

Sao Paulo(Brazil): world's second-largest city
Sao Paulo(Brazil): has  population   over ten million.

Three other cities: have grown to more than nine million people.

Bombay(India): have grown to more than nine million people.

Jakarta(Indonesia) and Karachi(Pakistan): have grown to more than nine million people.

Comment on Re: The (futile?) quest for an automatic paraphrase engine

Replies are listed 'Best First'.
Re: Re: The (futile?) quest for an automatic paraphrase engine by dimar (Curate) on May 17, 2004 at 23:14 UTC
Hey dude, where's the code?	[reply]
Re: Re: Re: The (futile?) quest for an automatic paraphrase engine by rje (Deacon) on May 18, 2004 at 14:51 UTC
Frankly, I'm embarrased, because I'm BFI'ing it, instead of doing things properly. But here goes. Against my better judgement. # # WARNING WARNING WARNING WARNING # # USE AT YOUR OWN RISK. # # THIS IS A MASSIVE KLUDGE. # # YOU HAVE BEEN WARNED. # my $in = <DATA>; # ASSUME sentences end in a period and a space. my @sentences = split '\. ', $in; foreach( @sentences ) { # ASSUME these words are mostly useless # for our purposes... s/\b(with\|a\|of\|the\|in\|just)\b//gi; # ASSUME phrases are comma-separated. my @phrases = split ','; my @subjects = (); my @descs = (); foreach ( @phrases ) { s/^\s*//; # trim leading spaces. s/\n//g; # remove newline. # Well, do we have a subject, or a descriptor? # ASSUME subjects are capitalized (!!) push @subjects, $_ if /^[A-Z]/; # ASSUME descriptions are not. push @descs, $_ unless /^[A-Z]/; } # Print 'em all out. foreach my $subj ( @subjects ) { my @subsub = ($subj); # ASSUME 'and' separates multiple subjects (!!) @subsub = split ' and ', $subj if $subj =~ /\band\b/; foreach my $ss (@subsub) { print "$ss: $_\n" foreach @descs; } } } __DATA__ With a population of more than 10.2 million, Seoul, the capital of Sou +th Korea, is the world's largest city in terms of population. Sao Pau +lo(Brazil), the world's second-largest city, has a population of just + over ten million. Three other cities, Bombay(India), Jakarta(Indones +ia) and Karachi(Pakistan), have grown to more than nine million peopl +e. [download] The output: `Seoul: population more than 10.2 million Seoul: capital South Korea Seoul: is world's largest city terms population Sao Paulo(Brazil): world's second-largest city Sao Paulo(Brazil): has population over ten million Three other cities: have grown to more than nine million people. Bombay(India): have grown to more than nine million people. Jakarta(Indonesia): have grown to more than nine million people. Karachi(Pakistan): have grown to more than nine million people.` [download]	[reply] [d/l] [select]
Re: Re: Re: Re: The (futile?) quest for an automatic paraphrase engine by Anonymous Monk on May 19, 2004 at 02:12 UTC
It's nice how you put up the warning siren!! on your assumptions ... Although in isolation, some might criticize the assumptions as overly simplistic (even the OP??), I bet something like this could actually work as the beginnings of a very flexible tool. It would be a matter of building up a 'catalogue' of such assumptions, make them user-configurable (eg apply only a certain subset based on the input text specimen) and give the user the opportunity to add custom assumptions. Moreover, this kind of model is realatively straightforward to understand with low entry-barrier-learning-curve. ... this one got the wheels turning hmmm ...	[reply]