Hi!

I have a huge pile of crap Excel VBA code, several 10_000 lines, to be analysed and to be ported to another language, essentially to get rid of the Excel files. The code was partly recorded, partly written by an ungifted amateur, over several years.

Indenting is mostly random, lines are wrapped as randomly as the indenting. So I wrote a small script to fix indenting and line wrapping. That script also extracts each function to a separate file.

So far, I extracted about 100 functions, and I guess another 100 are still waiting for me in Excel files I have not yet touched.

Subroutine calls must have been a very strange and hard to understand concept to the author, so instead, he copied blocks of several hundred lines around, resulting in functions of 1000 lines and more AFTER cleaning up indenting and line wraps. Of course, the copied blocks were later changed, but not all copies at the same time and in the same way.

Now my problem is to identify blocks copied from one function to another, after both copies have been modified. Indent changes and wrapping changes should have been eliminated by my script, but smaller changes remain: Spelling errors fixed in only one or two copies, implicit objects suddenly inserted (e.g. ActiveSheet.Cells(...) instead of just Cells(...)), comments added or removed, empty lines added or removed, some code lines commented out in only one copy, and so on.

So, checksums won't work. Blindly running diff on any pair of functions will generate a lot of noise and only little signal. Grouping functions by line count or byte count, then running diff only on functions with similar sizes might generate a little bit less noise. Basically, any function having less than about 100 lines is unlikely to contain eroded/evolved copies of other functions. That sorts out about 80% of the functions, leaving me with 20 known and probably another 20 unknown functions.

Is there a smarter way to find those modified copies of existing code?

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

In reply to [OT] Finding similar program code by afoken

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.