Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re: Is it possible to find the matching words and the percentage of matching words between two texts?

by rovf (Priest)
on Dec 21, 2012 at 09:29 UTC ( [id://1009891]=note: print w/replies, xml ) Need Help??


in reply to Is it possible to find the matching words and the percentage of matching words between two texts?

How do you want the following cases to be dealt with?

Case 1:

$a="a b c d e f"; $b="f e d c b a";

Case 2:
$a="a a a a a"; $b="a"
-- 
Ronald Fischer <ynnor@mm.st>
  • Comment on Re: Is it possible to find the matching words and the percentage of matching words between two texts?
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Is it possible to find the matching words and the percentage of matching words between two texts?
by supriyoch_2008 (Monk) on Dec 21, 2012 at 10:06 UTC

    Hi rovf

    Thanks for your quick reply. I need case 2. As a teacher, I want to find out to what extent any two students in my class have copied each other's assignment. Majority of the students (out of 30) are sincere and hard working. But it appears to me that nearly four students often plagiarize their written assignments i.e. I think they copy from others' assignments without visiting library or consulting textbooks/research papers. That is why I need a working perl script which can detect the degree of plagiarism adopted by the doubtful students. This is a very personal case. I just want to tell the students that I am not satisfied with their assignments should I detect more than 80% matched words. I don't know whether perl script can solve this problem faced by me. I want to make those (four) students more hard-working not only in studies but also in other spheres of life.

    Regards

      What you really need is to align the two texts with a "dynamic programming" algorithm. This is a common task in bioinformatics - but the atomic unit there is a single character - and there is a small number of expected characters (usually 4 or 20). You would have to hack it a fair bit to work with an array of words from an essentially unlimited "character set" - but I haven't looked in detail at the code:

      Bio::Tools::dpAlign

      For quick and dirty I would extend the hash comparison approach to handle words, word pairs, triplets and maybe more. Also maybe keep searching CPAN maybe there's something else out there.

        Hi uncoolbob,

        Thanks for providing information about Bio::Tools.

        With regards

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1009891]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2024-04-18 13:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found