Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: The (futile?) quest for an automatic paraphrase engine

by rje (Deacon)
on May 17, 2004 at 20:43 UTC ( [id://354104]=note: print w/replies, xml ) Need Help??


in reply to The (futile?) quest for an automatic paraphrase engine

Well, if you can guarantee that the specimen encapsulates all of the grammar rules you're likely to find, and the sentences themselves consist of the only patterns you're going to find, then you can brute-force some perl out that's not too painful. Woodenly using your input sample as THE pattern, a 30-line script can blindly cobble together this kind of output (not perfect but close):
Seoul: population  more than 10.2 million
Seoul: capital  South Korea
Seoul: is  world's largest city  terms  population.

Sao Paulo(Brazil): world's second-largest city
Sao Paulo(Brazil): has  population   over ten million.

Three other cities: have grown to more than nine million people.

Bombay(India): have grown to more than nine million people.

Jakarta(Indonesia) and Karachi(Pakistan): have grown to more than nine million people.
  • Comment on Re: The (futile?) quest for an automatic paraphrase engine

Replies are listed 'Best First'.
Re: Re: The (futile?) quest for an automatic paraphrase engine
by dimar (Curate) on May 17, 2004 at 23:14 UTC

    Hey dude, where's the code?

      Frankly, I'm embarrased, because I'm BFI'ing it, instead of doing things properly.

      But here goes. Against my better judgement.
      # # WARNING WARNING WARNING WARNING # # USE AT YOUR OWN RISK. # # THIS IS A MASSIVE KLUDGE. # # YOU HAVE BEEN WARNED. # my $in = <DATA>; # ASSUME sentences end in a period and a space. my @sentences = split '\. ', $in; foreach( @sentences ) { # ASSUME these words are mostly useless # for our purposes... s/\b(with|a|of|the|in|just)\b//gi; # ASSUME phrases are comma-separated. my @phrases = split ','; my @subjects = (); my @descs = (); foreach ( @phrases ) { s/^\s*//; # trim leading spaces. s/\n//g; # remove newline. # Well, do we have a subject, or a descriptor? # ASSUME subjects are capitalized (!!) push @subjects, $_ if /^[A-Z]/; # ASSUME descriptions are not. push @descs, $_ unless /^[A-Z]/; } # Print 'em all out. foreach my $subj ( @subjects ) { my @subsub = ($subj); # ASSUME 'and' separates multiple subjects (!!) @subsub = split ' and ', $subj if $subj =~ /\band\b/; foreach my $ss (@subsub) { print "$ss: $_\n" foreach @descs; } } } __DATA__ With a population of more than 10.2 million, Seoul, the capital of Sou +th Korea, is the world's largest city in terms of population. Sao Pau +lo(Brazil), the world's second-largest city, has a population of just + over ten million. Three other cities, Bombay(India), Jakarta(Indones +ia) and Karachi(Pakistan), have grown to more than nine million peopl +e.
      The output:
      Seoul: population more than 10.2 million Seoul: capital South Korea Seoul: is world's largest city terms population Sao Paulo(Brazil): world's second-largest city Sao Paulo(Brazil): has population over ten million Three other cities: have grown to more than nine million people. Bombay(India): have grown to more than nine million people. Jakarta(Indonesia): have grown to more than nine million people. Karachi(Pakistan): have grown to more than nine million people.

        It's nice how you put up the *warning siren!!* on your assumptions ... Although in isolation, some might criticize the assumptions as overly simplistic (even the OP??), I bet something like this could actually work as the beginnings of a very flexible tool. It would be a matter of building up a 'catalogue' of such assumptions, make them user-configurable (eg apply only a certain subset based on the input text specimen) and give the user the opportunity to add custom assumptions. Moreover, this kind of model is realatively straightforward to understand with low entry-barrier-learning-curve. ... this one got the wheels turning hmmm ...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://354104]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-26 01:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found