cntrtrst has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks. I'm looking for some quasi-semantic wisdom on the following problem. I'd like to compare sentences to each other to see if they're the same, but making allowances for added/missing words or typos.

This maybe isn't a perl question in the strictest sense, but I'll be using perl to do it. I've considered various forms of diff, including WordDiff which is nice but not quite what I'm after. The algorithm I'm considering now does spot checks of substrings at random indices, but that also raises hard to answer questions about what constitutes an acceptable margin of error and I'm not sure if it will work very well in the wild.

The purpose is to get incoming text streams and compare them to a template to determine if the person is using the template or deviating from the template. In this application, people will be allowed and even encouraged to deviate from the template they're given to write, but I want to be able to determine when that's happening in real time.

One thing that should make the problem easier is that users should be either attempting to copy the template or clearly doing something else. The two behaviors should be quite clearly distinct and, to the eye, would be easily distinguishable. However, a human reader can judge the meaning of the sentence being evaluated and I think that's actually the first line of analysis that informs the rest (such as noticing typos).

Any general thoughts on algorithms to approach this problem with will be appreciated. Thank you very much.

Replies are listed 'Best First'.
Re: comparing sentences
by Your Mother (Archbishop) on Oct 29, 2016 at 19:30 UTC
Re: comparing sentences
by Albannach (Monsignor) on Oct 29, 2016 at 19:22 UTC
    Interesting one. I've made good use of Levenshtein distance and variants thereof many times (see Text::Levenshtein), and it will work for sentences as well as it works for words (though you may have some speed issues as the length increases - I don't know how long your text samples are going to be). Off the top of my head I'd think about comparing the current user input to the same length initial substring of the template. A result of 0 means they're exactly copying it, and a result equal to the current string length means they're completely different. Maybe if you could prepare some sample data and your desired output we could get closer to an answer.

    --
    I'd like to be able to assign to an luser

Re: comparing sentences
by BrowserUk (Patriarch) on Oct 29, 2016 at 19:35 UTC

    With your description as given, I'd be hard pushed to come up with any approach; or even rule one out.

    I think you would get far better responses if you supplied 3 further pieces of data:

    1. A sample template.

      If the real ones are proprietary, makes something up that is consistent with the real thing.

    2. A (complete) sample input where the user is "using the template".

      Along with an indication of how far through that sample you would visually determine that they were.

    3. A (complete) sample input where the user "is deviating from the template".

      Again, an indication of how much of that you would need to see before visually determining that they were not.

    With that, you might get better, more targeted responses.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: comparing sentences
by AnomalousMonk (Archbishop) on Oct 29, 2016 at 19:47 UTC
Re: comparing sentences
by tybalt89 (Monsignor) on Oct 30, 2016 at 16:49 UTC

    Just pick a reasonable value to test $ratio against (like maybe 1.0):

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1174957 use strict; use warnings; use Algorithm::Diff qw(traverse_sequences); my $template = <<END; The earliest conceptualization of human rights is credited to ideas ab +out natural rights emanating from natural law. In particular, the iss +ue of universal rights was introduced by the examination of extending + rights to indigenous peoples by Spanish clerics, such as Francisco d +e Vitoria and Bartolomé de Las Casas. In the Valladolid debate, Juan +Ginés de Sepúlveda, who maintained an Aristotelian view of humanity a +s divided into classes of different worth, argued with Las Casas, who + argued in favour of equal rights to freedom from slavery for all hum +ans regardless of race or religion. END my $copywitherrors = <<END; The earliest conceptulization of human rights is credited to ideas abo +ut rights emanating from natural law. In particular the issue of univ +ersal rights was introduced by extending rights to indigenous peoples + by Spanish clerics, such as Francisco de Vitoria and Bartolomé de La +s Casas. In the Valladolid debate, Juan Ginés de Sepúlveda, who maint +ained an Aristotelian view of humanity as divided into clases of diff +erent worth argued with Las Casas, who argued in favour of equal righ +ts to freedom from slavery for all humans regardless of race. END my $somethingelse = <<END; Although ideas of rights and liberty have existed in some form for muc +h of human history, there is agreement that the earlier conceptions d +o not closely resemble the modern conceptions of human rights. Accord +ing to Jack Donnelly, in the ancient world, "traditional societies ty +pically have had elaborate systems of duties... conceptions of justic +e, political legitimacy, and human flourishing that sought to realize + human dignity, flourishing, or well-being entirely independent of hu +man rights. These institutions and practices are alternative to, rath +er than different formulations of, human rights".14 The history of hu +man rights can be traced to past documents, particularly Constitution + of Medina (622), Al-Risalah al-Huquq (659-713), Magna Carta (1215), +the Twelve Articles of Memmingen (1525), the English Bill of Rights ( +1689), the French Declaration of the Rights of Man and of the Citizen + (1789), and the Bill of Rights in the United States Constitution (17 +91) END sub compare { my ($old, $new) = @_; my $changes = 0; traverse_sequences( [ split //, $old ], [ split //, $new ], { DISCARD_A => sub { $changes++ }, DISCARD_B => sub { $changes++ }, } ); return $changes; }; my $length = length $template; print "length: $length\n"; my $answer = compare($template, $copywitherrors); my $ratio = $answer / $length; printf "copywitherrors: %d ratio %.3f\n", $answer, $ratio; $answer = compare($template, $somethingelse); $ratio = $answer / $length; printf "somethingelse: %d ratio %.3f\n", $answer, $ratio;
Re: comparing sentences
by kcott (Archbishop) on Oct 30, 2016 at 14:10 UTC

    G'day cntrtrst,

    "The purpose is to get incoming text streams and compare them to a template ..."

    Would this template lend itself to conversion to a regular expression? If so, your comparison might just be "$input =~ /$template_regex/".

    [Note: My reply is purely a guess at a solution. As others have already said, better information from you will result in better answers from us. See "How do I post a question effectively?".]

    — Ken

Re: comparing sentences
by cntrtrst (Initiate) on Oct 30, 2016 at 05:41 UTC
    oops. should have replied instead of commenting.
A reply falls below the community's threshold of quality. You may see it by logging in.