go ahead... be a heretic | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
n-gram comparison may be helpful. You could keep data as external files via a foreign table (via file_fdw). Or read them into a regular table: tens of thousands of lines doesn't sound too large: you could slurp all code into a postgres table (a line a record) and use the n-gram comparison machinery (see module pg_trgm [1] in the fine manual). That module works with trigrams and it gives (amongst others) a 'similarity' function that might be useful, for instance comparing similarity of the lines that you already identified and have 'extracted', to all others, hopefully finding the still 'hidden' ones. (there's even n-gram indexing (i.e. fast search) although that seems not really necessary) (postgres also has a module called 'fuzzystrmatch' [2] which contains several string comparison functions, for instance Levenshtein. But I've always had more luck with the n-gram stuff.) [1] pg_trgm module - postgresql manual [2] fuzzystrmatch module - postgresql manual Edit: A different/similar example with postgres n-gram comparison: Re: String Comparison & Equivalence Challenge In reply to Re: [OT] Finding similar program code
by erix
|
|