Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

[OT] Finding similar program code

by afoken (Canon)
on Aug 27, 2021 at 11:10 UTC ( #11136125=perlquestion: print w/replies, xml ) Need Help??

afoken has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

I have a huge pile of crap Excel VBA code, several 10_000 lines, to be analysed and to be ported to another language, essentially to get rid of the Excel files. The code was partly recorded, partly written by an ungifted amateur, over several years.

Indenting is mostly random, lines are wrapped as randomly as the indenting. So I wrote a small script to fix indenting and line wrapping. That script also extracts each function to a separate file.

So far, I extracted about 100 functions, and I guess another 100 are still waiting for me in Excel files I have not yet touched.

Subroutine calls must have been a very strange and hard to understand concept to the author, so instead, he copied blocks of several hundred lines around, resulting in functions of 1000 lines and more AFTER cleaning up indenting and line wraps. Of course, the copied blocks were later changed, but not all copies at the same time and in the same way.

Now my problem is to identify blocks copied from one function to another, after both copies have been modified. Indent changes and wrapping changes should have been eliminated by my script, but smaller changes remain: Spelling errors fixed in only one or two copies, implicit objects suddenly inserted (e.g. ActiveSheet.Cells(...) instead of just Cells(...)), comments added or removed, empty lines added or removed, some code lines commented out in only one copy, and so on.

So, checksums won't work. Blindly running diff on any pair of functions will generate a lot of noise and only little signal. Grouping functions by line count or byte count, then running diff only on functions with similar sizes might generate a little bit less noise. Basically, any function having less than about 100 lines is unlikely to contain eroded/evolved copies of other functions. That sorts out about 80% of the functions, leaving me with 20 known and probably another 20 unknown functions.

Is there a smarter way to find those modified copies of existing code?

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re: [OT] Finding similar program code
by hippo (Bishop) on Aug 27, 2021 at 11:33 UTC
    but smaller changes remain

    If small enough, perhaps something like Text::Levenshtein might help to flag up possible candidates.

    Is there a smarter way

    I can tell from your posts here that you are experienced enough to know this, so for the benefit of other readers I'll just mention that sometimes the smartest approach is to burn the entire edifice to the ground, let its ashes be dispersed by the four winds and start afresh.


    🦛

      I can tell from your posts here that you are experienced enough to know this, so for the benefit of other readers I'll just mention that sometimes the smartest approach is to burn the entire edifice to the ground, let its ashes be dispersed by the four winds and start afresh.

      That is exactly our intention (no refactoring in VBA), but the mathematics hidden in the Excel files has to be reproduced exactly.

      Some background (as much as I can tell without risking a lot of trouble): Measurement results from several days of running a device on a test rig under varying conditions are written to CSV files, then read into the first Excel file, processed, saved, passed along to next N Excel files, and finally results in a set of calibration parameters that will be written back to the device. The various Excel files could generates a lot of printouts that are already disabled in the code, because nobody cares. But whatever replacement we will build, it must generate exactly the same calibration parameters given the same measurement results. It must be bug-compatible with whatever happens in the Excel files.

      And, you may have guessed it: The entire mess is strictly confidential. I can't show the VBA code, I can't explain more details.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        If your definition of "exactly the same calibration parameters" does not allow for differences due to the quirks of optimization and floating-point arithmetic, your project is probably doomed. Now is the time to start work on a validation procedure. How else might you convince your employer that the job is done?
        Bill
        «…The entire mess is strictly confidential…»

        Any thread deserves a song.

        «The Crux of the Biscuit is the Apostrophe»

Re: [OT] Finding similar program code
by erix (Prior) on Aug 27, 2021 at 15:34 UTC

    n-gram comparison may be helpful. You could keep data as external files via a foreign table (via file_fdw). Or read them into a regular table: tens of thousands of lines doesn't sound too large: you could slurp all code into a postgres table (a line a record) and use the n-gram comparison machinery (see module pg_trgm [1] in the fine manual). That module works with trigrams and it gives (amongst others) a 'similarity' function that might be useful, for instance comparing similarity of the lines that you already identified and have 'extracted', to all others, hopefully finding the still 'hidden' ones. (there's even n-gram indexing (i.e. fast search) although that seems not really necessary)

    (postgres also has a module called 'fuzzystrmatch' [2] which contains several string comparison functions, for instance Levenshtein. But I've always had more luck with the n-gram stuff.)

    [1] pg_trgm module - postgresql manual

    [2] fuzzystrmatch module - postgresql manual

    Edit: A different/similar example with postgres n-gram comparison:

    Re: String Comparison & Equivalence Challenge

Re: [OT] Finding similar program code
by davies (Prior) on Aug 27, 2021 at 11:42 UTC

    ElasticSearch will give you results with scores to indicate how well search terms match. There is a containerised image on Docker hub. The current version (they don't support 'latest') is 7.14.

    Regards,

    John Davies

Re: [OT] Finding similar program code
by bliako (Monsignor) on Aug 27, 2021 at 12:35 UTC
Re: [OT] Finding similar program code
by Anonymous Monk on Aug 27, 2021 at 11:25 UTC
      Thanks, I wanted to point to just these two older threads! :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: [OT] Finding similar program code
by afoken (Canon) on Sep 09, 2021 at 16:50 UTC

    First, thanks for all replies.

    It seems I'm lucky. My guess was completely wrong. I had not one, but two piles of Excel files, and I only looked at the first one, assuming the other one would be of similar size. The second pile had exactly one Excel file that did not contain a single line of VBA.

    And my solution to the first pile was good old wetware for detecting repeated fragments, Notepad++ for brute-force searches for code fragments that seemed to be unique to the respective fragments, TortoiseMerge for comparing similar functions, and of course Strawberry Perl to run the perl script that cleans up and splits the VBA code copied out of Excel.

    It is not the most elegant way, but in the end, it took less time than learning and setting up one of the proposed solutions.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: [OT] Finding similar program code
by karlgoethebier (Abbot) on Aug 29, 2021 at 07:35 UTC
    «… VBA code, several 10_000 lines, to be analysed and to be ported to another language, essentially to get rid of the Excel files…»

    Coincidentally I inherited two such similar projects some years ago. Project #1: About some hundreds of VBA macros grown over many years and spread over several departments in the customers company. We failed extracting and porting them and hence we decided to do a total rewrite of all the stuff in Perl, PHP, Flex and Transact SQL. 5 years of hard work. OK, several releases etc. Project #2: Similar situation. After I explained my experiences with project #1 to the customer he gave up. Respectively he switched to SAP. Which is the same thing. Good luck and best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re: [OT] Finding similar program code
by jdporter (Chancellor) on Aug 30, 2021 at 15:13 UTC

    I'm faced with a similar problem in a project I'm working on, except it's C# code, and the copypasta code isn't neatly encapsulated in functions. @Previous_Programmer c&p'd large-ish chunks of code around within one big function, which is now around 5,000 lines in length. :-(

      I feel your pain. So far, almost all functions in the mess I inherited are click handlers for buttons or other elements, the remaining few are called from one or the other click handlers. And the functions I've analyzed so far had no more than about 3000 lines after automatic indent and wrap cleanup.

      I know C-style copypasta code from previous experience (see "You are too academic" in "Your code sucks"), where I had the luck that it was finally made somebody else's problem. Crappy code is one thing, but it becomes really nasty when documentation, editor and compiler don't agree about the language specification, and compiler and runtime have bugs that are triggered about daily in the production systems - like silently NOT executing SQL queries.

      Looking back at "too academic", I should be quite happy with VBA. Yes, it is a messy basic, with clear origins in the 16 bit world (for loops become faster when you declare the loop ariable as integer, which limits the loop variable to 16 bit signed integers even on 64 bit systems), some default behaviours are plain weird, and the two variants of if-then-else (block form and line form, one with endif, one without) are annoying for my stupid little cleanup script. But at least it is a much cleaner language than that Hunchback-Gollum-Salvatore C derivate, and it does not prevent you from checking for errors.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        that Hunchback-Gollum-Salvatore C derivate

        wait - do you mean C#? Because imho C# is a great language.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11136125]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2022-05-25 19:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (90 votes). Check out past polls.

    Notices?