in reply to Re: general advice finding duplicate code
in thread general advice finding duplicate code

looks like will only identify duplicated but individual lines of code across the scripts

Every approach is this approach :) its like a search engine

You iterate over you files, and you index each file

To index, you pick a unit (ex one word, or three adjacent lines of code)

Generate a list of all units for a file

Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters...

Hash each unit (sha1), and associate all this in a database

Then, to find duplication, query the database to find duplicate hashes

This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one

It goes without saying before making code changes, you need a comprehensive test suite :)

  • Comment on Re^2: general advice finding duplicate code