looks like will only identify duplicated but individual lines of code across the scripts
Every approach is this approach :) its like a search engine
You iterate over you files, and you index each file
To index, you pick a unit (ex one word, or three adjacent lines of code)
Generate a list of all units for a file
Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters...
Hash each unit (sha1), and associate all this in a database
Then, to find duplication, query the database to find duplicate hashes
This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one
It goes without saying before making code changes, you need a comprehensive test suite :)
|