Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Suffix-prefix matching done right

by baxy77bax (Deacon)
on Nov 04, 2021 at 16:21 UTC ( [id://11138432]=perlquestion: print w/replies, xml ) Need Help??

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:

A: abbaba B: babbaaaa A: abbaba B: --babbaaaa
Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :
#!/usr/bin/perl use strict; srand(4); for (1..10){ &find_a_sufpref_match($_,&generate_random_string(4),&generate_random +_string(4),1); } sub find_a_sufpref_match { my ($id,$a,$b,$msm) = @_; my @a = split("",$a); my @b = split("",$b); my ($start,$length, $m) = (0,0,0); for (my $i = @a; $i>=0; $i--){ my ($x,$mx,$tl,$j) = ($i,0,0,0); while( $j <= @a-$i){ $mx++ if $a[$x] ne $b[$j]; $tl++; if ($mx > $msm){ if ($j < @a-$i){ $tl = 0; } $mx--; last; } $x++; $j++; } ($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length) } print "$id:($m)($start,$length)($a,$b)\n"; } sub generate_random_string{ my $length=shift;# the length of the random string to generate my @chars= qw(a b); my $random_string; foreach (1..$length){ $random_string.=$chars[rand @chars]; } return $random_string; }
Which outputs:
1:(1)(0,4)(bbab,bbbb) 2:(1)(0,4)(aaaa,aaba) 3:(1)(1,3)(baaa,abaa) 4:(1)(2,2)(aabb,bbab) 5:(1)(0,4)(abab,abaa) 6:(1)(0,4)(aabb,babb) 7:(1)(3,1)(aaaa,bbaa) 8:(1)(3,1)(baaa,bbba) 9:(1)(0,4)(aaab,baab) 10:(0)(0,4)(aaaa,aaaa)
Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel).

Does anyone know a smarter (faster) way to solve this matching problem ?

What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.

Anyways, thank you for any or none input provided :)

Replies are listed 'Best First'.
Re: Suffix-prefix matching done right
by tybalt89 (Monsignor) on Nov 04, 2021 at 21:22 UTC

    smarter ?? faster - for that you'll have to benchmark...

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11138432 use warnings; sub randomstring { join '', map +(qw(a b))[rand 2], 1 .. shift } for (1 .. 10) { match( $_, randomstring(4), randomstring(4), 1); } sub match { my ($n, $x, $y, $m) = @_; $x =~ /(.+)$(??{ ($1 ^ substr $y, 0, length $1) =~ tr!\0!!c <= $m ? '' : '(*FAIL)' })/ and print "$n:($-[1],@{[ length $1 ]})($x,$y)\n"; }

    Outputs:

    1:(1,3)(abba,baaa) 2:(3,1)(bbbb,aaba) 3:(1,3)(abba,abab) 4:(1,3)(abbb,bbaa) 5:(2,2)(baba,bbbb) 6:(1,3)(bbab,aabb) 7:(0,4)(aaaa,aaab) 8:(2,2)(baba,baab) 9:(2,2)(aaba,baaa) 10:(1,3)(bbab,bbba)

    I hope I understood your problem correctly...

    Update: fixed $_ to $n

Re: Suffix-prefix matching done right
by vr (Curate) on Nov 04, 2021 at 23:18 UTC

    Without addressing parallelization nor performing any benchmarks, here's naive implementation. Hopefully, fuzzy matching implemented and optimized in C is fast. Maybe, re-writing this with String::Approx instead of re::engine::TRE (subroutine call instead of Perl regexp engine overhead) would be faster. Re-visiting fuzzy string matching was fun :). Solution below was more readable/clear before I tried to get to (perceived, no tests) optimizations like dropping blocks, reversing loop, + remembering that to use a reference to substr result is efficient (no idea if it holds in this case), etc., but here's FWIW.

    use strict; use warnings; use feature 'say'; use constant ERR => 1; use re::engine::TRE max_cost => ERR, cost_ins => -1, # no insertions cost_del => -1, # no deletions ; my $suffix_source = 'abbaba'; my $prefix_source = 'babbaaaa'; my $max_len = 0; $prefix_source =~ /^${ \substr $suffix_source, -$_ }/ and $max_len = $_ and last for reverse ERR + 1 .. length $suffix_source; say substr( $suffix_source, -$max_len ), ' ', substr( $prefix_source, 0, $max_len ) if $max_len; __END__ baba babb

    Update: crude benchmarks (~200 chars strings, ~10 errors allowed) reveal that, for this task, both modules I mentioned are hugely (some 10s of times) slower than "classic" Perl implementation by tybalt89 :)

Re: Suffix-prefix matching done right
by LanX (Saint) on Nov 04, 2021 at 18:27 UTC
    FWIW: you might be able to use Bitwise String Operators to your benefit. (with different sub-strings of $A and counting the ` afterwards with tr )

    But I have problems to match documentation with the behavior of 5.32

    DB<32> p "$A $B" abbaba babbaaaa DB<33> no feature 'bitwise'; print 'abbaba' & 'babbaaaa' ``b``a DB<34> no feature 'bitwise'; print $A & $B 0 DB<35> no feature 'bitwise'; print "$A" & "$B" ``b``a DB<36> print "$A" & "$B" 0 DB<37>

    looks like feature bitwise is now default w/o being explicitly explained in perlop

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Suffix-prefix matching done right
by karlgoethebier (Abbot) on Nov 04, 2021 at 17:45 UTC
    «…process them in parallel…»

    Probably not the baddest idea. Spontaneously mce_loop_f from MCE::Loop comes in my mind. For a trivial example see Re^3: Hash Search is VERY slow. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

Re: Suffix-prefix matching done right
by LanX (Saint) on Nov 04, 2021 at 18:03 UTC
    Are you only comparing pairwise, or do you need to compare all combinations of two sets of strings?

    I could think of a regex which is efficient, but the overhead to compile it has to be considered.

    edit

    furthermore, is the alphabet restricted to a and b ?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I am only comparing pairwise and the alphabet is a bit larger, Base64. I restricted it to the above binary form just to make it simpler.
Re: Suffix-prefix matching done right
by NERDVANA (Deacon) on Nov 05, 2021 at 11:18 UTC
    Find an algorithm you like for comparing edit-distance of two strings, and then just loop until it crosses a threshold or you run out of string.

    use Text::WagnerFischer "distance"; use List::Util 'min'; my $a="abbaba"; my $b="ababbba"; my $limit= 3; my $n= 0; my $max= min(length $a, length $b); ++$n while $n < $max && distance(substr($a,-$n-1), substr($b,0,$n+1)) < $limit; if ($n) { printf "'%s' matches '%s' with edit distance %d", substr($a,-$n), substr($b,0,$n), distance(substr($a,-$n), substr($b,0,$n)) }

    To parallelize for extremely long matches, you could have N threads step $n += N.

    Edit: sorry, I skimmed the second half of your question. For batches of 10 million pairs of smallish strings, I think you want to use C or C++. If you have a need to integrate it with Perl code, you can use Inline::C, or just IPC::Run to run a pool of C++ workers which you could then pipe data into and collect the responses. Note that edit distances are a harder problem than your example code can solve, since sometimes having solved it for N characters, adding 1 character could have an optimal edit distance by a completely different set of changes to the string. But, Wagner-Fischer is O(N^2) each time you run it, so maybe you can look for a new variation that calculates "sideways" for each new character rather than starting from scratch each time.

    Edit Again: as anonymous points out, a better edit distance might come after a worse edit distance, so actually you would need to iterate over the whole range of substrings.

      then just loop until it crosses a threshold

      then with A,B,X of OP's example, the correct result will be missed

        Ah, True. In a different example, "abcdef" and "abcdef" would iteratively have an edit distance of 1, 2, 3, 4, 2, and then suddenly 0. So you would need to check all lengths.
      «…or just IPC::Run to run a pool of C++ workers…»

      I just wondered how this code would look. Or how you would do it - as you like. Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        I was expecting IPC::Run to be the simplest way to orchestrate a pool of non-perl workers that stream text, but it might not actually be any less effort than than MCE; this bit of code took longer to write than I expected. It reads STDIN in large chunks, and hands off those chunks to the input queues of the worker with the least in its input queue. It also takes care to divide the input on line boundaries so that it isn't splitting a line between different workers. But, it should be much more efficient than the simpler design of reading STDIN one line at a time.

        I tested it with "wc" as a simple way to verify that all input lines were seen by a worker.

        $ time perl parallel.pl wc < LARGE_FILE (snip)... Waiting for workers to finish 955 5389459 56444912 252 6883234 68045448 284 3803959 62591850 422 6429932 63866545 real 0m1.438s user 0m3.864s sys 0m0.242s $ time wc < LARGE_FILE 1913 22506581 250948755 real 0m2.332s user 0m2.277s sys 0m0.052s

        parallel.pl

        use v5.20; use strict; use warnings; use IPC::Run 'start'; use List::Util qw/ all any reduce /; my $workers= 4; my $read_size= 512*1024; my $queue_full= 256*1024; my $command= [@ARGV]; # Start a pool of 10 workers my @jobs= map { { id => $_, in => '', out => '' } } 1..$workers; my @run_spec= map { ('&', $command, '<', \$_->{in}, '>', \$_->{out}) } + @jobs; shift @run_spec; my $harness= start(@run_spec); my $buf= ''; my $got= 1; while ($got) { # The reading of input could just use $buf .= <STDIN> but this is +much more efficient, # reading 1MB at a time. $got= read(STDIN, $buf, $read_size, length $buf); if ($got) { # look for the final newline within the buffer. This makes su +re we only pass whole # lines to the workers my $end= rindex $buf, '\n'; next unless $end >= 0; # append the lines to the input queue of the worker with the s +mallest input queue $jobs[0]{in} .= substr($buf, 0, $end+1, ''); say "Add ".($end+1)." bytes to job $jobs[0]{id}"; } elsif (!defined $got) { next if $!{EINTR} || $!{EAGAIN}; die "read: $!"; } # No more STDIN, so take any leftover string and add it to an inpu +t buffer elsif (length $buf) { $jobs[0]{in} .= $buf; say "Add ".length($buf)." bytes to job $jobs[0]{id}"; } # stay in this loop until there is room on one of the input buffer +s, # or after EOF, stay here until all input buffers are flushed while ($got? (all { length $_->{in} > $queue_full } @jobs) : (any { length $_->{in} > 0 } @jobs) ) { # say "I/O: ".join(' ', map sprintf("[%8d:%8d]", length $_->{ +in}, length $_->{out}), @jobs); # send input, receive output $harness->pump; # process all output from jobs so far process_results(); } # sort the worker with the smallest input queue to the front @jobs= sort { length $a->{in} <=> length $b->{in} } @jobs; } say "Waiting for workers to finish"; # Close the input pipes and wait for workers to exit $_->{in}= undef for @jobs; $harness->finish; # process all the rest process_results(); sub process_results { for (@jobs) { my $end= rindex $_->{out}, "\n"; print substr($_->{out}, 0, $end+1, '') if $end >= 0; } }

        Searching through MCE repo, I found mce_grep which ships with MCE. It depends on the OS vender including it or a separate package. The script calls an external binary, grep in this case. One can rename the file, copy, or make a symbolic link to the script itself to have it called another external binary.

        It borrows bits from IPC::Run3 for capturing STDOUT/ERR on Unix platforms, but an attempt for lesser overhead it seems. See here. So this looks to be a full-fledged demonstration for doing just that; IPC::Run-like capability to run a pool of C/C++ workers. It exists :).

        See also: The pure-Perl grep-like implementation is found in the mce-examples repo named egrep.pl.

        Digging more, there lives another full-fledged demonstration calling Inline::C and an external C++ binary. This lives in the mce-sort repo. The Inline::C bits and corresponding user function partitions the data. Sorting the individual partitions in parallel (via an external C++ binary) is done here.

Re: Suffix-prefix matching done right
by LanX (Saint) on Nov 05, 2021 at 11:43 UTC
    Please provide a real test case!

    Randomly chosen base64 strings will almost never overlap. And comparing the efficiency of different solutions is not possible with the fuzzy specs given.

    So efficiency will also depend on the nature of the real data.

    My suggestions:

    • for 2 same length substrings $a (tail $A) and $b (head $B) you'll get the number of missmatches $m by counting the none null bytes in ($a ^ $b)
    • those mismatches have to be <= your allowed threshold $X
    • (see my remarks on no feature 'bitwise' the docs don't seem to be up to date anymore°)

    But you don't need to try all possible length, because only a small number of positions are likely:

    • for $X=0 you only need to check those tails in $A which start with the first character $b0 in $B
    • for $X=1 you also need to check those positions which start with the second character $b1 in $B (assuming $b0 was a mismatch)
    • and so on for bigger $X

    This should be far more efficient than your code.

    updates
    • You've not provided a ranking function, allowing to stop the search for different solutions.

      I.e. when is a longer overlap with one mismatch "better" than a shorter with none?

    • checking base64 for incomplete matches sounds like an XY problem to me

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    °) euphemism for buggy

    edit
    added none

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11138432]
Approved by hippo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (7)
As of 2024-03-28 11:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found