Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: [OT] Finding similar program code

by erix (Prior)
on Aug 27, 2021 at 15:34 UTC ( #11136135=note: print w/replies, xml ) Need Help??


in reply to [OT] Finding similar program code

n-gram comparison may be helpful. You could keep data as external files via a foreign table (via file_fdw). Or read them into a regular table: tens of thousands of lines doesn't sound too large: you could slurp all code into a postgres table (a line a record) and use the n-gram comparison machinery (see module pg_trgm [1] in the fine manual). That module works with trigrams and it gives (amongst others) a 'similarity' function that might be useful, for instance comparing similarity of the lines that you already identified and have 'extracted', to all others, hopefully finding the still 'hidden' ones. (there's even n-gram indexing (i.e. fast search) although that seems not really necessary)

(postgres also has a module called 'fuzzystrmatch' [2] which contains several string comparison functions, for instance Levenshtein. But I've always had more luck with the n-gram stuff.)

[1] pg_trgm module - postgresql manual

[2] fuzzystrmatch module - postgresql manual

Edit: A different/similar example with postgres n-gram comparison:

Re: String Comparison & Equivalence Challenge

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11136135]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2022-01-21 16:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (59 votes). Check out past polls.

    Notices?