in reply to Suffix-prefix matching done right
Find an algorithm you like for comparing edit-distance of two strings, and then just loop until it crosses a threshold or you run out of string.
use Text::WagnerFischer "distance";
use List::Util 'min';
my $a="abbaba";
my $b="ababbba";
my $limit= 3;
my $n= 0;
my $max= min(length $a, length $b);
++$n while $n < $max
&& distance(substr($a,-$n-1), substr($b,0,$n+1)) < $limit;
if ($n) {
printf "'%s' matches '%s' with edit distance %d",
substr($a,-$n), substr($b,0,$n),
distance(substr($a,-$n), substr($b,0,$n))
}
To parallelize for extremely long matches, you could have N threads step $n += N.
Edit: sorry, I skimmed the second half of your question. For batches of 10 million pairs of smallish strings, I think you want to use C or C++. If you have a need to integrate it with Perl code, you can use Inline::C, or just IPC::Run to run a pool of C++ workers which you could then pipe data into and collect the responses. Note that edit distances are a harder problem than your example code can solve, since sometimes having solved it for N characters, adding 1 character could have an optimal edit distance by a completely different set of changes to the string. But, Wagner-Fischer is O(N^2) each time you run it, so maybe you can look for a new variation that calculates "sideways" for each new character rather than starting from scratch each time.
Edit Again: as anonymous points out, a better edit distance might come after a worse edit distance, so actually you would need to iterate over the whole range of substrings.
Re^2: Suffix-prefix matching done right
by vr (Curate) on Nov 05, 2021 at 12:18 UTC
|
then just loop until it crosses a threshold
then with A,B,X of OP's example, the correct result will be missed
| [reply] [Watch: Dir/Any] |
|
Ah, True. In a different example, "abcdef" and "abcdef" would iteratively have an edit distance of 1, 2, 3, 4, 2, and then suddenly 0. So you would need to check all lengths.
| [reply] [Watch: Dir/Any] |
Re^2: Suffix-prefix matching done right
by karlgoethebier (Abbot) on Nov 05, 2021 at 18:16 UTC
|
«…or just IPC::Run to run a pool of C++ workers…»
I just wondered how this code would look. Or how you would do it - as you like. Regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] [Watch: Dir/Any] |
|
$ time perl parallel.pl wc < LARGE_FILE
(snip)...
Waiting for workers to finish
955 5389459 56444912
252 6883234 68045448
284 3803959 62591850
422 6429932 63866545
real 0m1.438s
user 0m3.864s
sys 0m0.242s
$ time wc < LARGE_FILE
1913 22506581 250948755
real 0m2.332s
user 0m2.277s
sys 0m0.052s
parallel.pl
use v5.20;
use strict;
use warnings;
use IPC::Run 'start';
use List::Util qw/ all any reduce /;
my $workers= 4;
my $read_size= 512*1024;
my $queue_full= 256*1024;
my $command= [@ARGV];
# Start a pool of 10 workers
my @jobs= map { { id => $_, in => '', out => '' } } 1..$workers;
my @run_spec= map { ('&', $command, '<', \$_->{in}, '>', \$_->{out}) }
+ @jobs;
shift @run_spec;
my $harness= start(@run_spec);
my $buf= '';
my $got= 1;
while ($got) {
# The reading of input could just use $buf .= <STDIN> but this is
+much more efficient,
# reading 1MB at a time.
$got= read(STDIN, $buf, $read_size, length $buf);
if ($got) {
# look for the final newline within the buffer. This makes su
+re we only pass whole
# lines to the workers
my $end= rindex $buf, '\n';
next unless $end >= 0;
# append the lines to the input queue of the worker with the s
+mallest input queue
$jobs[0]{in} .= substr($buf, 0, $end+1, '');
say "Add ".($end+1)." bytes to job $jobs[0]{id}";
}
elsif (!defined $got) {
next if $!{EINTR} || $!{EAGAIN};
die "read: $!";
}
# No more STDIN, so take any leftover string and add it to an inpu
+t buffer
elsif (length $buf) {
$jobs[0]{in} .= $buf;
say "Add ".length($buf)." bytes to job $jobs[0]{id}";
}
# stay in this loop until there is room on one of the input buffer
+s,
# or after EOF, stay here until all input buffers are flushed
while ($got? (all { length $_->{in} > $queue_full } @jobs)
: (any { length $_->{in} > 0 } @jobs)
) {
# say "I/O: ".join(' ', map sprintf("[%8d:%8d]", length $_->{
+in}, length $_->{out}), @jobs);
# send input, receive output
$harness->pump;
# process all output from jobs so far
process_results();
}
# sort the worker with the smallest input queue to the front
@jobs= sort { length $a->{in} <=> length $b->{in} } @jobs;
}
say "Waiting for workers to finish";
# Close the input pipes and wait for workers to exit
$_->{in}= undef for @jobs;
$harness->finish;
# process all the rest
process_results();
sub process_results {
for (@jobs) {
my $end= rindex $_->{out}, "\n";
print substr($_->{out}, 0, $end+1, '') if $end >= 0;
}
}
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
| [reply] [Watch: Dir/Any] |
|
|
|
|
|
|
|
Dear friends of the monastery,
Upon searching, there lives a MCE wc.pl example. Please note that timings for wc.pl include the time to load Perl itself plus modules and spawn/reap workers.
Preparation:
$ curl -O http://www.textfiles.com/etext/FICTION/lesms10.txt
$ curl -O https://raw.githubusercontent.com/marioroy/mce-examples/mast
+er/other/wc.pl
$ chmod 755 wc.pl
$ rm -f lesms10_big.txt
$ for i in `seq 1 200`; do cat lesms10.txt >> lesms10_big.txt; done
$ ls -lh lesms10*
-rw-r--r-- 1 pmuser staff 3.1M Nov 13 19:27 lesms10.txt
-rw-r--r-- 1 pmuser staff 623M Nov 13 19:39 lesms10_big.txt
# Read once before testing to be fair for 1st and subsequent wc/wc.pl.
# Sorry, I didn't want to go through the trouble clearing OS cache.
$ cat lesms10_big.txt >/dev/null
Results: wc
$ time wc lesms10_big.txt
14152200 113528600 652751200 lesms10_big.txt
real 0m2.132s
user 0m2.076s
sys 0m0.056s
$ time wc < lesms10_big.txt
14152200 113528600 652751200
real 0m2.140s
user 0m2.085s
sys 0m0.055s
Results: wc.pl
$ time ./wc.pl lesms10_big.txt
14152200 113528600 652751200 lesms10_big.txt
real 0m0.536s (3.98x)
user 0m3.570s
sys 0m0.591s
$ time ./wc.pl < lesms10_big.txt
14152200 113528600 652751200
real 0m0.740s (2.89x)
user 0m3.784s
sys 0m1.166s
Results: line count
$ time wc -l lesms10_big.txt
14152200 lesms10_big.txt
real 0m0.351s
user 0m0.296s
sys 0m0.055s
$ time ./wc.pl -l lesms10_big.txt
14152200 lesms10_big.txt
real 0m0.208s
user 0m0.678s
sys 0m0.835s
Results: word count
$ time wc -w lesms10_big.txt
113528600 lesms10_big.txt
real 0m2.125s
user 0m2.071s
sys 0m0.054s
$ time ./wc.pl -w lesms10_big.txt
113528600 lesms10_big.txt
real 0m0.535s
user 0m3.574s
sys 0m0.599s
Results: byte count
$ time wc -c lesms10_big.txt
652751200 lesms10_big.txt
real 0m0.003s
user 0m0.001s
sys 0m0.002s
$ time ./wc.pl -c lesms10_big.txt
652751200 lesms10_big.txt
real 0m0.054s
user 0m0.042s
sys 0m0.010s
Clearing OS cache:
The beliefs of yesterday may no longer be true today. These days, running on one core may be the bottleneck versus IO. Especially for hardware IO reaching gigabytes and beyond. Long ago, I removed old beliefs no longer matching with reality. This mind shift energized me to try, only to be surprised. So kudos to the OP for trying.
Linux: sudo echo 1 >/proc/sys/vm/drop_caches
macOS: sudo purge
$ time wc lesms10_big.txt
14152200 113528600 652751200 lesms10_big.txt
real 0m2.435s
user 0m2.330s
sys 0m0.098s
Linux: sudo echo 1 >/proc/sys/vm/drop_caches
macOS: sudo purge
$ time ./wc.pl lesms10_big.txt
14152200 113528600 652751200 lesms10_big.txt
real 0m0.662s (3.68x)
user 0m3.598s
sys 0m0.736s
Bless to you all, | [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
| [reply] [Watch: Dir/Any] |
|
Searching through MCE repo, I found mce_grep which ships with MCE. It depends on the OS vender including it or a separate package. The script calls an external binary, grep in this case. One can rename the file, copy, or make a symbolic link to the script itself to have it called another external binary.
It borrows bits from IPC::Run3 for capturing STDOUT/ERR on Unix platforms, but an attempt for lesser overhead it seems. See here. So this looks to be a full-fledged demonstration for doing just that; IPC::Run-like capability to run a pool of C/C++ workers. It exists :).
See also: The pure-Perl grep-like implementation is found in the mce-examples repo named egrep.pl.
Digging more, there lives another full-fledged demonstration calling Inline::C and an external C++ binary. This lives in the mce-sort repo. The Inline::C bits and corresponding user function partitions the data. Sorting the individual partitions in parallel (via an external C++ binary) is done here.
| [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
|
|