Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^5: Suffix-prefix matching done right

by NERDVANA (Deacon)
on Nov 06, 2021 at 15:47 UTC ( [id://11138508]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Suffix-prefix matching done right
in thread Suffix-prefix matching done right

A correct edit-distance algorithm is going to be at least O(N^2) and maybe even O(N^3) to check all the different lengths and some strings can be 2000 chars. My code above was able to beat linear "wc". Of course the file was probably entirely cached in ram since I have a lot of that on this system. But I would not be surprised if a SSD could deliver data faster than an N^3 could process it.
  • Comment on Re^5: Suffix-prefix matching done right

Replies are listed 'Best First'.
Re^6: Suffix-prefix matching done right (updated)
by LanX (Saint) on Nov 06, 2021 at 16:06 UTC
    > at least O(N^2)

    A disagree, rather at most O(N*log N) for base64 input and a given max distance.

    My algorithm based on Re: Suffix-prefix matching done right correctly calculated 1million overlaps with up to 6 errors in 2 seconds 53 sec , without being particularly tuned for speed.

    You just have to avoid impossible matches right from the beginning and to bail out with the longest match.

    Nobody needs all Levenshtein distances for all combinations.

    Parameters:

    my $n_tests = 1_000_000; my $error= 1+ int rand 5; my $A = rand64($n_err + 1 + rand 50); my $B = rand64($n_err + 1 + rand 50); my $prefix = rand64($n_err + 1 + rand 20); > Took: 2.20354223251343 sec

    update
    sorry

    > Took: 53.5716695785522

    I forgot a bail-out flag.

    I'll try now strings with 2000 bytes ...

    update
    my $n_tests = 100_000; my $min_str = 2000;# 50; my $max_overlap = 500; #20; > Took: 56.4621770381927

    as I said, it's still tuned for correctness not for performance

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      OP said it could also be deletions and insertions. So the xor trick won't work. Without more information from OP, we can only guess, but I was guessing this was a genetics problem trying to build DNA sequences from fragments, and the mention of base64 was just mock data to test the algorithm. If the data can have random insertions, deletions, and swaps, then you need the full DP algorithm. *IF* the number of acceptable errors is low, then yes you can short-circuit it and maybe end up close to linear, but then there's stil the added complexity of deciding how much overlap gives the best result.
        > OP said it could also be deletions and insertions.

        Did he?

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11138508]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-19 01:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found