Re^5: Suffix-prefix matching done right


P is for Practical
	PerlMonks

Re^5: Suffix-prefix matching done right

by NERDVANA (Deacon)

on Nov 06, 2021 at 15:47 UTC ( [id://11138508]=note: print w/replies, xml )

Need Help??

in reply to Re^4: Suffix-prefix matching done right
in thread Suffix-prefix matching done right

A correct edit-distance algorithm is going to be at least O(N^2) and maybe even O(N^3) to check all the different lengths and some strings can be 2000 chars. My code above was able to beat linear "wc". Of course the file was probably entirely cached in ram since I have a lot of that on this system. But I would not be surprised if a SSD could deliver data faster than an N^3 could process it.

Comment on Re^5: Suffix-prefix matching done right

Replies are listed 'Best First'.
Re^6: Suffix-prefix matching done right (updated) by LanX (Saint) on Nov 06, 2021 at 16:06 UTC
> at least O(N^2) A disagree, rather at most O(N*log N) for base64 input and a given max distance. My algorithm based on Re: Suffix-prefix matching done right correctly calculated 1million overlaps with up to 6 errors in ~~2 seconds~~ 53 sec , without being particularly tuned for speed. You just have to avoid impossible matches right from the beginning and to bail out with the longest match. Nobody needs all Levenshtein distances for all combinations. Parameters: `my $n_tests = 1_000_000; my $error= 1+ int rand 5; my $A = rand64($n_err + 1 + rand 50); my $B = rand64($n_err + 1 + rand 50); my $prefix = rand64($n_err + 1 + rand 20); > Took: 2.20354223251343 sec` [download] update sorry `> Took: 53.5716695785522` I forgot a bail-out flag. I'll try now strings with 2000 bytes ... update `my $n_tests = 100_000; my $min_str = 2000;# 50; my $max_overlap = 500; #20; > Took: 56.4621770381927` [download] as I said, it's still tuned for correctness not for performance Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^7: Suffix-prefix matching done right (updated) by NERDVANA (Deacon) on Nov 06, 2021 at 18:24 UTC
OP said it could also be deletions and insertions. So the xor trick won't work. Without more information from OP, we can only guess, but I was guessing this was a genetics problem trying to build DNA sequences from fragments, and the mention of base64 was just mock data to test the algorithm. If the data can have random insertions, deletions, and swaps, then you need the full DP algorithm. IF the number of acceptable errors is low, then yes you can short-circuit it and maybe end up close to linear, but then there's stil the added complexity of deciding how much overlap gives the best result.	[reply]
Re^8: Suffix-prefix matching done right (updated) by LanX (Saint) on Nov 06, 2021 at 18:35 UTC
> OP said it could also be deletions and insertions. Did he? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^9: Suffix-prefix matching done right (updated) by NERDVANA (Deacon) on Nov 07, 2021 at 08:02 UTC

In Section Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://11138508]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others learning in the Monastery: (2)

As of 2024-04-19 01:37 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found