Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^4: Suffix-prefix matching done right

by LanX (Saint)
on Nov 06, 2021 at 10:12 UTC ( [id://11138496]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Suffix-prefix matching done right
in thread Suffix-prefix matching done right

I seriously doubt parallelization is the solution here.

Reading the data from disk will take far more time than processing the data with a clever algorithm.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

  • Comment on Re^4: Suffix-prefix matching done right

Replies are listed 'Best First'.
Re^5: Suffix-prefix matching done right
by NERDVANA (Deacon) on Nov 06, 2021 at 15:47 UTC
    A correct edit-distance algorithm is going to be at least O(N^2) and maybe even O(N^3) to check all the different lengths and some strings can be 2000 chars. My code above was able to beat linear "wc". Of course the file was probably entirely cached in ram since I have a lot of that on this system. But I would not be surprised if a SSD could deliver data faster than an N^3 could process it.
      > at least O(N^2)

      A disagree, rather at most O(N*log N) for base64 input and a given max distance.

      My algorithm based on Re: Suffix-prefix matching done right correctly calculated 1million overlaps with up to 6 errors in 2 seconds 53 sec , without being particularly tuned for speed.

      You just have to avoid impossible matches right from the beginning and to bail out with the longest match.

      Nobody needs all Levenshtein distances for all combinations.

      Parameters:

      my $n_tests = 1_000_000; my $error= 1+ int rand 5; my $A = rand64($n_err + 1 + rand 50); my $B = rand64($n_err + 1 + rand 50); my $prefix = rand64($n_err + 1 + rand 20); > Took: 2.20354223251343 sec

      update
      sorry

      > Took: 53.5716695785522

      I forgot a bail-out flag.

      I'll try now strings with 2000 bytes ...

      update
      my $n_tests = 100_000; my $min_str = 2000;# 50; my $max_overlap = 500; #20; > Took: 56.4621770381927

      as I said, it's still tuned for correctness not for performance

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        OP said it could also be deletions and insertions. So the xor trick won't work. Without more information from OP, we can only guess, but I was guessing this was a genetics problem trying to build DNA sequences from fragments, and the mention of base64 was just mock data to test the algorithm. If the data can have random insertions, deletions, and swaps, then you need the full DP algorithm. *IF* the number of acceptable errors is low, then yes you can short-circuit it and maybe end up close to linear, but then there's stil the added complexity of deciding how much overlap gives the best result.
Re^5: Suffix-prefix matching done right
by karlgoethebier (Abbot) on Nov 09, 2021 at 09:27 UTC

    What about a correct algorithm plus parallelization? If this problem is embarrassingly parallel. And I think it is. Or may be it is. I always have doubts. Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

        I haven't seen a parallelizable disk drive yet...

        Here you go ;-)

        Please take a look at mce_loop_f { code } file from MCE::Loop. And didn’t we have a debate here some years ago where marioroy convinced our beloved former leader BrowserUk that some things look different today? I couldn’t find this inspiring thread in a hurry as I’m on a mobile device and in a hurry. Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11138496]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-04-23 06:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found