Re^3: Suffix-prefix matching done right

I was expecting IPC::Run to be the simplest way to orchestrate a pool of non-perl workers that stream text, but it might not actually be any less effort than than MCE; this bit of code took longer to write than I expected. It reads STDIN in large chunks, and hands off those chunks to the input queues of the worker with the least in its input queue. It also takes care to divide the input on line boundaries so that it isn't splitting a line between different workers. But, it should be much more efficient than the simpler design of reading STDIN one line at a time.

I tested it with "wc" as a simple way to verify that all input lines were seen by a worker.

$ time perl parallel.pl wc < LARGE_FILE
(snip)...
Waiting for workers to finish
    955 5389459 56444912
    252 6883234 68045448
    284 3803959 62591850
    422 6429932 63866545

real    0m1.438s
user    0m3.864s
sys    0m0.242s

$ time wc < LARGE_FILE
     1913  22506581 250948755

real    0m2.332s
user    0m2.277s
sys    0m0.052s
[download]

parallel.pl

use v5.20;
use strict;
use warnings;
use IPC::Run 'start';
use List::Util qw/ all any reduce /;

my $workers= 4;
my $read_size= 512*1024;
my $queue_full= 256*1024;
my $command= [@ARGV];

# Start a pool of 10 workers
my @jobs= map { { id => $_, in => '', out => '' } } 1..$workers;
my @run_spec= map { ('&', $command, '<', \$_->{in}, '>', \$_->{out}) }
+ @jobs;
shift @run_spec;
my $harness= start(@run_spec);

my $buf= '';
my $got= 1;
while ($got) {
    # The reading of input could just use $buf .= <STDIN> but this is 
+much more efficient,
    # reading 1MB at a time.
    $got= read(STDIN, $buf, $read_size, length $buf);
    if ($got) {
        # look for the final newline within the buffer.  This makes su
+re we only pass whole
        # lines to the workers
        my $end= rindex $buf, '\n';
        next unless $end >= 0;

        # append the lines to the input queue of the worker with the s
+mallest input queue
        $jobs[0]{in} .= substr($buf, 0, $end+1, '');
        say "Add ".($end+1)." bytes to job $jobs[0]{id}";
    }
    elsif (!defined $got) {
        next if $!{EINTR} || $!{EAGAIN};
        die "read: $!";
    }
    # No more STDIN, so take any leftover string and add it to an inpu
+t buffer
    elsif (length $buf) {
        $jobs[0]{in} .= $buf;
        say "Add ".length($buf)." bytes to job $jobs[0]{id}";
    }

    # stay in this loop until there is room on one of the input buffer
+s,
    # or after EOF, stay here until all input buffers are flushed
    while ($got? (all { length $_->{in} > $queue_full } @jobs)
               : (any { length $_->{in} > 0 } @jobs)
    ) {
        # say "I/O: ".join('  ', map sprintf("[%8d:%8d]", length $_->{
+in}, length $_->{out}), @jobs);
        # send input, receive output
        $harness->pump;

        # process all output from jobs so far
        process_results();
    }
    # sort the worker with the smallest input queue to the front
    @jobs= sort { length $a->{in} <=> length $b->{in} } @jobs;
}

say "Waiting for workers to finish";
# Close the input pipes and wait for workers to exit
$_->{in}= undef for @jobs;
$harness->finish;

# process all the rest
process_results();

sub process_results {
    for (@jobs) {
        my $end= rindex $_->{out}, "\n";
        print substr($_->{out}, 0, $end+1, '') if $end >= 0;
    }
}
[download]

Comment on Re^3: Suffix-prefix matching done right Select or Download Code

Replies are listed 'Best First'.
Re^4: Suffix-prefix matching done right by LanX (Saint) on Nov 06, 2021 at 10:12 UTC
I seriously doubt parallelization is the solution here. Reading the data from disk will take far more time than processing the data with a clever algorithm. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^5: Suffix-prefix matching done right by NERDVANA (Deacon) on Nov 06, 2021 at 15:47 UTC
A correct edit-distance algorithm is going to be at least O(N^2) and maybe even O(N^3) to check all the different lengths and some strings can be 2000 chars. My code above was able to beat linear "wc". Of course the file was probably entirely cached in ram since I have a lot of that on this system. But I would not be surprised if a SSD could deliver data faster than an N^3 could process it.	[reply]
Re^6: Suffix-prefix matching done right (updated) by LanX (Saint) on Nov 06, 2021 at 16:06 UTC
> at least O(N^2) A disagree, rather at most O(N*log N) for base64 input and a given max distance. My algorithm based on Re: Suffix-prefix matching done right correctly calculated 1million overlaps with up to 6 errors in ~~2 seconds~~ 53 sec , without being particularly tuned for speed. You just have to avoid impossible matches right from the beginning and to bail out with the longest match. Nobody needs all Levenshtein distances for all combinations. Parameters: `my $n_tests = 1_000_000; my $error= 1+ int rand 5; my $A = rand64($n_err + 1 + rand 50); my $B = rand64($n_err + 1 + rand 50); my $prefix = rand64($n_err + 1 + rand 20); > Took: 2.20354223251343 sec` [download] update sorry `> Took: 53.5716695785522` I forgot a bail-out flag. I'll try now strings with 2000 bytes ... update `my $n_tests = 100_000; my $min_str = 2000;# 50; my $max_overlap = 500; #20; > Took: 56.4621770381927` [download] as I said, it's still tuned for correctness not for performance Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^7: Suffix-prefix matching done right (updated) by NERDVANA (Deacon) on Nov 06, 2021 at 18:24 UTC
Re^8: Suffix-prefix matching done right (updated) by LanX (Saint) on Nov 06, 2021 at 18:35 UTC
Some notes below your chosen depth have not been shown here
Re^5: Suffix-prefix matching done right by karlgoethebier (Abbot) on Nov 09, 2021 at 09:27 UTC
What about a correct algorithm plus parallelization? If this problem is embarrassingly parallel. And I think it is. Or may be it is. I always have doubts. Best regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply]
Re^6: Suffix-prefix matching done right by LanX (Saint) on Nov 09, 2021 at 10:57 UTC
I haven't seen a parallelizable disk drive yet... Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^7: Suffix-prefix matching done right by haukex (Archbishop) on Nov 09, 2021 at 11:14 UTC
Re^8: Suffix-prefix matching done right by LanX (Saint) on Nov 09, 2021 at 11:18 UTC
Some notes below your chosen depth have not been shown here
Re^7: Suffix-prefix matching done right by karlgoethebier (Abbot) on Nov 09, 2021 at 17:02 UTC
Re^4: Suffix-prefix matching done right by Anonymous Monk on Nov 14, 2021 at 03:02 UTC
Dear friends of the monastery, Upon searching, there lives a MCE wc.pl example. Please note that timings for wc.pl include the time to load Perl itself plus modules and spawn/reap workers. Preparation: $ curl -O http://www.textfiles.com/etext/FICTION/lesms10.txt $ curl -O https://raw.githubusercontent.com/marioroy/mce-examples/mast +er/other/wc.pl $ chmod 755 wc.pl $ rm -f lesms10_big.txt $ for i in `seq 1 200`; do cat lesms10.txt >> lesms10_big.txt; done $ ls -lh lesms10* -rw-r--r-- 1 pmuser staff 3.1M Nov 13 19:27 lesms10.txt -rw-r--r-- 1 pmuser staff 623M Nov 13 19:39 lesms10_big.txt # Read once before testing to be fair for 1st and subsequent wc/wc.pl. # Sorry, I didn't want to go through the trouble clearing OS cache. $ cat lesms10_big.txt >/dev/null [download] Results: wc `$ time wc lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m2.132s user 0m2.076s sys 0m0.056s $ time wc < lesms10_big.txt 14152200 113528600 652751200 real 0m2.140s user 0m2.085s sys 0m0.055s` [download] Results: wc.pl `$ time ./wc.pl lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m0.536s (3.98x) user 0m3.570s sys 0m0.591s $ time ./wc.pl < lesms10_big.txt 14152200 113528600 652751200 real 0m0.740s (2.89x) user 0m3.784s sys 0m1.166s` [download] Results: line count `$ time wc -l lesms10_big.txt 14152200 lesms10_big.txt real 0m0.351s user 0m0.296s sys 0m0.055s $ time ./wc.pl -l lesms10_big.txt 14152200 lesms10_big.txt real 0m0.208s user 0m0.678s sys 0m0.835s` [download] Results: word count `$ time wc -w lesms10_big.txt 113528600 lesms10_big.txt real 0m2.125s user 0m2.071s sys 0m0.054s $ time ./wc.pl -w lesms10_big.txt 113528600 lesms10_big.txt real 0m0.535s user 0m3.574s sys 0m0.599s` [download] Results: byte count `$ time wc -c lesms10_big.txt 652751200 lesms10_big.txt real 0m0.003s user 0m0.001s sys 0m0.002s $ time ./wc.pl -c lesms10_big.txt 652751200 lesms10_big.txt real 0m0.054s user 0m0.042s sys 0m0.010s` [download] Clearing OS cache: The beliefs of yesterday may no longer be true today. These days, running on one core may be the bottleneck versus IO. Especially for hardware IO reaching gigabytes and beyond. Long ago, I removed old beliefs no longer matching with reality. This mind shift energized me to try, only to be surprised. So kudos to the OP for trying. `Linux: sudo echo 1 >/proc/sys/vm/drop_caches macOS: sudo purge $ time wc lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m2.435s user 0m2.330s sys 0m0.098s Linux: sudo echo 1 >/proc/sys/vm/drop_caches macOS: sudo purge $ time ./wc.pl lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m0.662s (3.68x) user 0m3.598s sys 0m0.736s` [download] Bless to you all,	[reply] [d/l] [select]
Re^5: Suffix-prefix matching done right by Anonymous Monk on Nov 14, 2021 at 04:33 UTC
An attempt on Linux failed clearing the file cache. Thus an update and results from a Linux box. `This works: sudo bash -c "echo 1 >/proc/sys/vm/drop_caches"` [download] Results: `$ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time wc lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m3.412s user 0m3.307s sys 0m0.097s $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time ./wc.pl lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m0.613s (5.57x) user 0m4.168s sys 0m0.432s` [download] Beyond 8 cores: The --max-workers option defaults to auto (8 workers max). Running with --max-workers=N beyond 8 may provide extra performance on a box with more than 8 "real" cores. $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time ./wc.pl --max-workers=8 lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m0.613s (5.57x) user 0m4.168s sys 0m0.432s $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time ./wc.pl --max-workers=12 lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m0.454s (7.52x) user 0m4.464s sys 0m0.536s $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time ./wc.pl --max-workers=16 lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m0.360s (9.48x) user 0m4.344s sys 0m0.625s [download] Compare wc with wc.pl 1 and 2 cores: To complete the demonstration, I also tried wc.pl with --maxworkers=1 and --maxworkers=2. The timings for wc.pl include loading Perl plus modules and spawning/reaping workers. $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time wc lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m3.412s user 0m3.307s sys 0m0.097s $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time ./wc.pl --max-workers=1 lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m3.553s user 0m3.363s sys 0m0.329s $ sudo bash -c "echo 1 >/proc/sys/vm/drop_caches" $ time ./wc.pl --max-workers=2 lesms10_big.txt 14152200 113528600 652751200 lesms10_big.txt real 0m1.804s user 0m3.339s sys 0m0.357s [download] See also, egrep.pl from mce-examples.	[reply] [d/l] [select]
Re^4: Suffix-prefix matching done right by karlgoethebier (Abbot) on Nov 08, 2021 at 21:45 UTC
Thank you very much and best regards, Karl �The Crux of the Biscuit is the Apostrophe�	[reply]


Don't ask to ask, just ask
	PerlMonks

Re^3: Suffix-prefix matching done right

update

update