baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have a matching problem that i would like your input on. It does not need to be code input but more of, how you would approach it. What I have is a set of paired strings such that sometimes a suffix of a string A is a prefix of a string B. When I say sometimes i mean that this is not always the case. Moreover, the suffix and the prefix need not to be perfect matches, that is, up to X number of mismatches are allowed. Therefore, in case a suffix and prefix do not match, the longest common substring is equal to X (however, such a match is of no value). To illustrate the problem better, consider the following example:
A: abbaba
B: babbaaaa
A: abbaba
B: --babbaaaa
Given the X=1 suffix baba of A is the longest prefix of B. In order to find such matches i constructed a quite naive algorithm to solve it :
#!/usr/bin/perl
use strict;
srand(4);
for (1..10){
&find_a_sufpref_match($_,&generate_random_string(4),&generate_random
+_string(4),1);
}
sub find_a_sufpref_match {
my ($id,$a,$b,$msm) = @_;
my @a = split("",$a);
my @b = split("",$b);
my ($start,$length, $m) = (0,0,0);
for (my $i = @a; $i>=0; $i--){
my ($x,$mx,$tl,$j) = ($i,0,0,0);
while( $j <= @a-$i){
$mx++ if $a[$x] ne $b[$j];
$tl++;
if ($mx > $msm){
if ($j < @a-$i){
$tl = 0;
}
$mx--;
last;
}
$x++;
$j++;
}
($start,$length,$m) = ($i,$tl-1,$mx) if ($tl -1 > $length)
}
print "$id:($m)($start,$length)($a,$b)\n";
}
sub generate_random_string{
my $length=shift;# the length of the random string to generate
my @chars= qw(a b);
my $random_string;
foreach (1..$length){
$random_string.=$chars[rand @chars];
}
return $random_string;
}
Which outputs:
1:(1)(0,4)(bbab,bbbb)
2:(1)(0,4)(aaaa,aaba)
3:(1)(1,3)(baaa,abaa)
4:(1)(2,2)(aabb,bbab)
5:(1)(0,4)(abab,abaa)
6:(1)(0,4)(aabb,babb)
7:(1)(3,1)(aaaa,bbaa)
8:(1)(3,1)(baaa,bbba)
9:(1)(0,4)(aaab,baab)
10:(0)(0,4)(aaaa,aaaa)
Maybe the code has a bug in it, but currently i do not see it... The point is to start comparing strings inwards (increasing the size of a compared suffix/prefix match).Clearly this algorithm in O(|A|x|B|) and in case strings are random one could expect the runtime to be closer to linear since the X cutoff is expected to be reached frequently. In reality I am expecting to process 700-900 mil pairs each of size 20 - 2000 characters delivered in batches (files) of 10 mil pairs (that means i have a context of 10 mil pairs that i currently see no value in because the pairs are supposed to be unrelated, aside from the fact that i can process them in parallel). Does anyone know a smarter (faster) way to solve this matching problem ?
What i also considered was KMP implementation, but i am not sure whether the preprocessing is worth the time and I am not quite sure how to go about the growing pattern length and allowed mismatches.
Anyways, thank you for any or none input provided :)
Re: Suffix-prefix matching done right
by tybalt89 (Monsignor) on Nov 04, 2021 at 21:22 UTC
|
#!/usr/bin/perl
use strict; # https://perlmonks.org/?node_id=11138432
use warnings;
sub randomstring { join '', map +(qw(a b))[rand 2], 1 .. shift }
for (1 .. 10)
{
match( $_, randomstring(4), randomstring(4), 1);
}
sub match
{
my ($n, $x, $y, $m) = @_;
$x =~ /(.+)$(??{
($1 ^ substr $y, 0, length $1) =~ tr!\0!!c <= $m ? '' : '(*FAIL)'
})/ and print "$n:($-[1],@{[ length $1 ]})($x,$y)\n";
}
Outputs:
1:(1,3)(abba,baaa)
2:(3,1)(bbbb,aaba)
3:(1,3)(abba,abab)
4:(1,3)(abbb,bbaa)
5:(2,2)(baba,bbbb)
6:(1,3)(bbab,aabb)
7:(0,4)(aaaa,aaab)
8:(2,2)(baba,baab)
9:(2,2)(aaba,baaa)
10:(1,3)(bbab,bbba)
I hope I understood your problem correctly...
Update: fixed $_ to $n
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Suffix-prefix matching done right
by vr (Curate) on Nov 04, 2021 at 23:18 UTC
|
Without addressing parallelization nor performing any benchmarks, here's naive implementation. Hopefully, fuzzy matching implemented and optimized in C is fast. Maybe, re-writing this with String::Approx instead of re::engine::TRE (subroutine call instead of Perl regexp engine overhead) would be faster. Re-visiting fuzzy string matching was fun :). Solution below was more readable/clear before I tried to get to (perceived, no tests) optimizations like dropping blocks, reversing loop, + remembering that to use a reference to substr result is efficient (no idea if it holds in this case), etc., but here's FWIW.
use strict;
use warnings;
use feature 'say';
use constant ERR => 1;
use re::engine::TRE
max_cost => ERR,
cost_ins => -1, # no insertions
cost_del => -1, # no deletions
;
my $suffix_source = 'abbaba';
my $prefix_source = 'babbaaaa';
my $max_len = 0;
$prefix_source =~ /^${ \substr $suffix_source, -$_ }/
and $max_len = $_
and last
for reverse ERR + 1 .. length $suffix_source;
say substr( $suffix_source, -$max_len ),
' ',
substr( $prefix_source, 0, $max_len )
if $max_len;
__END__
baba babb
Update: crude benchmarks (~200 chars strings, ~10 errors allowed) reveal that, for this task, both modules I mentioned are hugely (some 10s of times) slower than "classic" Perl implementation by tybalt89 :)
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Suffix-prefix matching done right
by LanX (Saint) on Nov 04, 2021 at 18:27 UTC
|
DB<32> p "$A $B"
abbaba babbaaaa
DB<33> no feature 'bitwise'; print 'abbaba' & 'babbaaaa'
``b``a
DB<34> no feature 'bitwise'; print $A & $B
0
DB<35> no feature 'bitwise'; print "$A" & "$B"
``b``a
DB<36> print "$A" & "$B"
0
DB<37>
looks like feature bitwise is now default w/o being explicitly explained in perlop
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Suffix-prefix matching done right
by karlgoethebier (Abbot) on Nov 04, 2021 at 17:45 UTC
|
«…process them in parallel…»
Probably not the baddest idea. Spontaneously mce_loop_f from MCE::Loop comes in my mind. For a trivial example see Re^3: Hash Search is VERY slow. Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] [Watch: Dir/Any] [d/l] |
Re: Suffix-prefix matching done right
by LanX (Saint) on Nov 04, 2021 at 18:03 UTC
|
Are you only comparing pairwise, or do you need to compare all combinations of two sets of strings?
I could think of a regex which is efficient, but the overhead to compile it has to be considered.
edit
furthermore, is the alphabet restricted to a and b ?
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
I am only comparing pairwise and the alphabet is a bit larger, Base64. I restricted it to the above binary form just to make it simpler.
| [reply] [Watch: Dir/Any] |
Re: Suffix-prefix matching done right
by NERDVANA (Deacon) on Nov 05, 2021 at 11:18 UTC
|
Find an algorithm you like for comparing edit-distance of two strings, and then just loop until it crosses a threshold or you run out of string.
use Text::WagnerFischer "distance";
use List::Util 'min';
my $a="abbaba";
my $b="ababbba";
my $limit= 3;
my $n= 0;
my $max= min(length $a, length $b);
++$n while $n < $max
&& distance(substr($a,-$n-1), substr($b,0,$n+1)) < $limit;
if ($n) {
printf "'%s' matches '%s' with edit distance %d",
substr($a,-$n), substr($b,0,$n),
distance(substr($a,-$n), substr($b,0,$n))
}
To parallelize for extremely long matches, you could have N threads step $n += N.
Edit: sorry, I skimmed the second half of your question. For batches of 10 million pairs of smallish strings, I think you want to use C or C++. If you have a need to integrate it with Perl code, you can use Inline::C, or just IPC::Run to run a pool of C++ workers which you could then pipe data into and collect the responses. Note that edit distances are a harder problem than your example code can solve, since sometimes having solved it for N characters, adding 1 character could have an optimal edit distance by a completely different set of changes to the string. But, Wagner-Fischer is O(N^2) each time you run it, so maybe you can look for a new variation that calculates "sideways" for each new character rather than starting from scratch each time.
Edit Again: as anonymous points out, a better edit distance might come after a worse edit distance, so actually you would need to iterate over the whole range of substrings. | [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
Ah, True. In a different example, "abcdef" and "abcdef" would iteratively have an edit distance of 1, 2, 3, 4, 2, and then suddenly 0. So you would need to check all lengths.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
$ time perl parallel.pl wc < LARGE_FILE
(snip)...
Waiting for workers to finish
955 5389459 56444912
252 6883234 68045448
284 3803959 62591850
422 6429932 63866545
real 0m1.438s
user 0m3.864s
sys 0m0.242s
$ time wc < LARGE_FILE
1913 22506581 250948755
real 0m2.332s
user 0m2.277s
sys 0m0.052s
parallel.pl
use v5.20;
use strict;
use warnings;
use IPC::Run 'start';
use List::Util qw/ all any reduce /;
my $workers= 4;
my $read_size= 512*1024;
my $queue_full= 256*1024;
my $command= [@ARGV];
# Start a pool of 10 workers
my @jobs= map { { id => $_, in => '', out => '' } } 1..$workers;
my @run_spec= map { ('&', $command, '<', \$_->{in}, '>', \$_->{out}) }
+ @jobs;
shift @run_spec;
my $harness= start(@run_spec);
my $buf= '';
my $got= 1;
while ($got) {
# The reading of input could just use $buf .= <STDIN> but this is
+much more efficient,
# reading 1MB at a time.
$got= read(STDIN, $buf, $read_size, length $buf);
if ($got) {
# look for the final newline within the buffer. This makes su
+re we only pass whole
# lines to the workers
my $end= rindex $buf, '\n';
next unless $end >= 0;
# append the lines to the input queue of the worker with the s
+mallest input queue
$jobs[0]{in} .= substr($buf, 0, $end+1, '');
say "Add ".($end+1)." bytes to job $jobs[0]{id}";
}
elsif (!defined $got) {
next if $!{EINTR} || $!{EAGAIN};
die "read: $!";
}
# No more STDIN, so take any leftover string and add it to an inpu
+t buffer
elsif (length $buf) {
$jobs[0]{in} .= $buf;
say "Add ".length($buf)." bytes to job $jobs[0]{id}";
}
# stay in this loop until there is room on one of the input buffer
+s,
# or after EOF, stay here until all input buffers are flushed
while ($got? (all { length $_->{in} > $queue_full } @jobs)
: (any { length $_->{in} > 0 } @jobs)
) {
# say "I/O: ".join(' ', map sprintf("[%8d:%8d]", length $_->{
+in}, length $_->{out}), @jobs);
# send input, receive output
$harness->pump;
# process all output from jobs so far
process_results();
}
# sort the worker with the smallest input queue to the front
@jobs= sort { length $a->{in} <=> length $b->{in} } @jobs;
}
say "Waiting for workers to finish";
# Close the input pipes and wait for workers to exit
$_->{in}= undef for @jobs;
$harness->finish;
# process all the rest
process_results();
sub process_results {
for (@jobs) {
my $end= rindex $_->{out}, "\n";
print substr($_->{out}, 0, $end+1, '') if $end >= 0;
}
}
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
|
|
|
|
|
Searching through MCE repo, I found mce_grep which ships with MCE. It depends on the OS vender including it or a separate package. The script calls an external binary, grep in this case. One can rename the file, copy, or make a symbolic link to the script itself to have it called another external binary.
It borrows bits from IPC::Run3 for capturing STDOUT/ERR on Unix platforms, but an attempt for lesser overhead it seems. See here. So this looks to be a full-fledged demonstration for doing just that; IPC::Run-like capability to run a pool of C/C++ workers. It exists :).
See also: The pure-Perl grep-like implementation is found in the mce-examples repo named egrep.pl.
Digging more, there lives another full-fledged demonstration calling Inline::C and an external C++ binary. This lives in the mce-sort repo. The Inline::C bits and corresponding user function partitions the data. Sorting the individual partitions in parallel (via an external C++ binary) is done here.
| [reply] [Watch: Dir/Any] |
|
Re: Suffix-prefix matching done right
by LanX (Saint) on Nov 05, 2021 at 11:43 UTC
|
Please provide a real test case!
Randomly chosen base64 strings will almost never overlap. And comparing the efficiency of different solutions is not possible with the fuzzy specs given.
So efficiency will also depend on the nature of the real data.
My suggestions:
- for 2 same length substrings $a (tail $A) and $b (head $B) you'll get the number of missmatches $m by counting the none null bytes in ($a ^ $b)
- those mismatches have to be <= your allowed threshold $X
- (see my remarks on no feature 'bitwise' the docs don't seem to be up to date anymore°)
But you don't need to try all possible length, because only a small number of positions are likely:
- for $X=0 you only need to check those tails in $A which start with the first character $b0 in $B
- for $X=1 you also need to check those positions which start with the second character $b1 in $B (assuming $b0 was a mismatch)
- and so on for bigger $X
This should be far more efficient than your code.
updates
°) euphemism for buggy
edit
added none | [reply] [Watch: Dir/Any] [d/l] [select] |
|
|