http://qs1969.pair.com?node_id=11136135


in reply to [OT] Finding similar program code

n-gram comparison may be helpful. You could keep data as external files via a foreign table (via file_fdw). Or read them into a regular table: tens of thousands of lines doesn't sound too large: you could slurp all code into a postgres table (a line a record) and use the n-gram comparison machinery (see module pg_trgm [1] in the fine manual). That module works with trigrams and it gives (amongst others) a 'similarity' function that might be useful, for instance comparing similarity of the lines that you already identified and have 'extracted', to all others, hopefully finding the still 'hidden' ones. (there's even n-gram indexing (i.e. fast search) although that seems not really necessary)

(postgres also has a module called 'fuzzystrmatch' [2] which contains several string comparison functions, for instance Levenshtein. But I've always had more luck with the n-gram stuff.)

[1] pg_trgm module - postgresql manual

[2] fuzzystrmatch module - postgresql manual

Edit: A different/similar example with postgres n-gram comparison:

Re: String Comparison & Equivalence Challenge