Hi,

I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:

A: abbaba B: babbaaaa A: abbaba B: --babbaaaa
Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :
#!/usr/bin/perl use strict; srand(4); for (1..10){ &find_a_sufpref_match($_,&generate_random_string(4),&generate_random +_string(4),1); } sub find_a_sufpref_match { my ($id,$a,$b,$msm) = @_; my @a = split("",$a); my @b = split("",$b); my ($start,$length, $m) = (0,0,0); for (my $i = @a; $i>=0; $i--){ my ($x,$mx,$tl,$j) = ($i,0,0,0); while( $j <= @a-$i){ $mx++ if $a[$x] ne $b[$j]; $tl++; if ($mx > $msm){ if ($j < @a-$i){ $tl = 0; } $mx--; last; } $x++; $j++; } ($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length) } print "$id:($m)($start,$length)($a,$b)\n"; } sub generate_random_string{ my $length=shift;# the length of the random string to generate my @chars= qw(a b); my $random_string; foreach (1..$length){ $random_string.=$chars[rand @chars]; } return $random_string; }
Which outputs:
1:(1)(0,4)(bbab,bbbb) 2:(1)(0,4)(aaaa,aaba) 3:(1)(1,3)(baaa,abaa) 4:(1)(2,2)(aabb,bbab) 5:(1)(0,4)(abab,abaa) 6:(1)(0,4)(aabb,babb) 7:(1)(3,1)(aaaa,bbaa) 8:(1)(3,1)(baaa,bbba) 9:(1)(0,4)(aaab,baab) 10:(0)(0,4)(aaaa,aaaa)
Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel).

Does anyone know a smarter (faster) way to solve this matching problem ?

What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.

Anyways, thank you for any or none input provided :)

In reply to Suffix-prefix matching done right by baxy77bax

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.