Re: general advice finding duplicate code

Thanks for the responses so far. i'll look up the clone doctor code...however i cannot send this codebase to 3rd parties.
the second approach, using dumper, looks like will only identify duplicated but individual lines of code across the scripts...which would be just as easy to do using
cat *.php | sort | uniq -c
i'll keep thinking about it too..and will post any gems. a brute force reducing sliding window between two scripts is possible but probably blow out to hours/days of running time for the 40 or so script pair combinations.

the hardest line to type correctly is: stty erase ^H

Comment on Re: general advice finding duplicate code

Replies are listed 'Best First'.
Re^2: general advice finding duplicate code by Anonymous Monk on Jun 21, 2011 at 06:49 UTC
looks like will only identify duplicated but individual lines of code across the scripts Every approach is this approach :) its like a search engine You iterate over you files, and you index each file To index, you pick a unit (ex one word, or three adjacent lines of code) Generate a list of all units for a file Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters... Hash each unit (sha1), and associate all this in a database Then, to find duplication, query the database to find duplicate hashes This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one It goes without saying before making code changes, you need a comprehensive test suite :)	[reply]

Replies are listed 'Best First'.

Re^2: general advice finding duplicate code
by Anonymous Monk on Jun 21, 2011 at 06:49 UTC

looks like will only identify duplicated but individual lines of code across the scripts

Every approach is this approach :) its like a search engine

You iterate over you files, and you index each file

To index, you pick a unit (ex one word, or three adjacent lines of code)

Generate a list of all units for a file

Normalize each unit. For words you would stem (remove prefix/suffix..) to find the root, for lines you would remove insignificant whitespace, insignificant commas... normalize quoting characters...

Hash each unit (sha1), and associate all this in a database

Then, to find duplication, query the database to find duplicate hashes

This is not unlike what git (git gc) does, so I wouldn't be surprised if git provides provided a tool to help you visualize these duplications, although I don't know of one

It goes without saying before making code changes, you need a comprehensive test suite :)

[reply]