in reply to Re: The (futile?) quest for an automatic paraphrase engine
in thread The (futile?) quest for an automatic paraphrase engine

Hey dude, where's the code?

  • Comment on Re: Re: The (futile?) quest for an automatic paraphrase engine

Replies are listed 'Best First'.
Re: Re: Re: The (futile?) quest for an automatic paraphrase engine
by rje (Deacon) on May 18, 2004 at 14:51 UTC
    Frankly, I'm embarrased, because I'm BFI'ing it, instead of doing things properly.

    But here goes. Against my better judgement.
    # # WARNING WARNING WARNING WARNING # # USE AT YOUR OWN RISK. # # THIS IS A MASSIVE KLUDGE. # # YOU HAVE BEEN WARNED. # my $in = <DATA>; # ASSUME sentences end in a period and a space. my @sentences = split '\. ', $in; foreach( @sentences ) { # ASSUME these words are mostly useless # for our purposes... s/\b(with|a|of|the|in|just)\b//gi; # ASSUME phrases are comma-separated. my @phrases = split ','; my @subjects = (); my @descs = (); foreach ( @phrases ) { s/^\s*//; # trim leading spaces. s/\n//g; # remove newline. # Well, do we have a subject, or a descriptor? # ASSUME subjects are capitalized (!!) push @subjects, $_ if /^[A-Z]/; # ASSUME descriptions are not. push @descs, $_ unless /^[A-Z]/; } # Print 'em all out. foreach my $subj ( @subjects ) { my @subsub = ($subj); # ASSUME 'and' separates multiple subjects (!!) @subsub = split ' and ', $subj if $subj =~ /\band\b/; foreach my $ss (@subsub) { print "$ss: $_\n" foreach @descs; } } } __DATA__ With a population of more than 10.2 million, Seoul, the capital of Sou +th Korea, is the world's largest city in terms of population. Sao Pau +lo(Brazil), the world's second-largest city, has a population of just + over ten million. Three other cities, Bombay(India), Jakarta(Indones +ia) and Karachi(Pakistan), have grown to more than nine million peopl +e.
    The output:
    Seoul: population more than 10.2 million Seoul: capital South Korea Seoul: is world's largest city terms population Sao Paulo(Brazil): world's second-largest city Sao Paulo(Brazil): has population over ten million Three other cities: have grown to more than nine million people. Bombay(India): have grown to more than nine million people. Jakarta(Indonesia): have grown to more than nine million people. Karachi(Pakistan): have grown to more than nine million people.

      It's nice how you put up the *warning siren!!* on your assumptions ... Although in isolation, some might criticize the assumptions as overly simplistic (even the OP??), I bet something like this could actually work as the beginnings of a very flexible tool. It would be a matter of building up a 'catalogue' of such assumptions, make them user-configurable (eg apply only a certain subset based on the input text specimen) and give the user the opportunity to add custom assumptions. Moreover, this kind of model is realatively straightforward to understand with low entry-barrier-learning-curve. ... this one got the wheels turning hmmm ...