comment on

Hi!

I have a huge pile of ~~crap~~ Excel VBA code, several 10_000 lines, to be analysed and to be ported to another language, essentially to get rid of the Excel files. The code was partly recorded, partly written by an ungifted amateur, over several years.

Indenting is mostly random, lines are wrapped as randomly as the indenting. So I wrote a small script to fix indenting and line wrapping. That script also extracts each function to a separate file.

So far, I extracted about 100 functions, and I guess another 100 are still waiting for me in Excel files I have not yet touched.

Subroutine calls must have been a very strange and hard to understand concept to the author, so instead, he copied blocks of several hundred lines around, resulting in functions of 1000 lines and more AFTER cleaning up indenting and line wraps. Of course, the copied blocks were later changed, but not all copies at the same time and in the same way.

Now my problem is to identify blocks copied from one function to another, after both copies have been modified. Indent changes and wrapping changes should have been eliminated by my script, but smaller changes remain: Spelling errors fixed in only one or two copies, implicit objects suddenly inserted (e.g. ActiveSheet.Cells(...) instead of just Cells(...)), comments added or removed, empty lines added or removed, some code lines commented out in only one copy, and so on.

So, checksums won't work. Blindly running diff on any pair of functions will generate a lot of noise and only little signal. Grouping functions by line count or byte count, then running diff only on functions with similar sizes might generate a little bit less noise. Basically, any function having less than about 100 lines is unlikely to contain eroded/evolved copies of other functions. That sorts out about 80% of the functions, leaving me with 20 known and probably another 20 unknown functions.

Is there a smarter way to find those modified copies of existing code?

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to [OT] Finding similar program code by afoken

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Keep It Simple, Stupid
	PerlMonks