in reply to Re^5: Help required in optimizing the output (PERL)
in thread Help required in optimizing the output (PERL)

I don't know how it scales though yet.

Projecting from:

[ 0:53:04.64] c:\test>845979 1000 Generate fake paragraphs Sort Count reversed hash Solving 999/1000 [0 2 3 4 7 8 9 10 11 12 13 1 [4 6 7 8 10 11 12 13 14 15 1 [0 1 3 4 6 7 8 10 11 12 13 1 [0 5 6 7 8 9 10 12 14 15 18 11s [ 1:11:24.79] c:\test>845979 2000 Generate fake paragraphs Sort Count reversed hash Solving 1999/2000 [0 1 4 5 6 7 10 12 13 14 15 16 20 [1 3 5 6 7 9 11 13 14 16 17 18 19 [1 3 4 5 6 7 8 9 10 11 13 14 15 1 [1 2 6 7 8 9 10 12 14 19 20 21 22 50s [ 1:12:25.86] c:\test>845979 4000 Generate fake paragraphs Sort Count reversed hash Solving 3999/4000 [0 1 5 9 10 11 13 14 15 16 18 19 2 [0 1 4 8 9 11 13 14 15 16 18 19 20 [1 4 5 7 8 9 10 11 12 15 16 17 18 [2 3 4 5 6 7 8 9 11 13 17 18 19 21 220s

Each doubling of the number of paras, multiplies the time taken by 4.4. If we double 1000 9 times we get 512,000 (close enough). So, 11 * 4.4^9 gives: 6799340 seconds == 79 days.

Though you will run out of memory long before then on commodity hardware.

Using a testcase generator along the lines of yours, my code gives:

Note Like yours, I don't time the discovery phase. Just the solution phase.

[ 1:31:22.89]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.000000 seconds All the words are covered by 9 paras: [ 1:31:38.04]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.008884907 seconds All the words are covered by 10 paras: [ 1:32:07.65]C:\test>845818 -W=200 -P=2000 > nul Words: 200 Paras: 2000 Took 0.020982027 seconds All the words are covered by 9 paras: [ 1:32:30.77]C:\test>845818 -W=200 -P=4000 > nul Words: 200 Paras: 4000 Took 0.046000004 seconds All the words are covered by 10 paras: [ 1:32:53.62]C:\test>845818 -W=200 -P=8000 > nul Words: 200 Paras: 8000 Took 0.248465061 seconds All the words are covered by 10 paras: [ 1:33:32.98]C:\test>845818 -W=200 -P=16000 > nul Words: 200 Paras: 16000 Took 1.472499847 seconds All the words are covered by 10 paras: [ 1:34:44.75]C:\test>845818 -W=200 -P=32000 > nul Words: 200 Paras: 32000 Took 6.217499971 seconds All the words are covered by 10 paras: [ 1:36:55.87]C:\test>845818 -W=200 -P=64000 > nul Words: 200 Paras: 64000 Took 25.427500010 seconds All the words are covered by 9 paras:

It requires 4 more doublings to get from 64000 to 512,000. And 25.4 / 6.2 = 4.0. So starting with 25.4 * 4.0^4 = 6502.4 = 1.806 hours.

To confirm the projection, I'll leave mine running on a 512,000 paras run while I sleep and post the results tomorrow.

Update: It completed in 27 minutes and used apeek of 500MB::

[ 1:45:29.21]C:\test>845818 -W=200 -P=512000 > nul Words: 200 Paras: 512000 Took 1657.894999981 seconds All the words are covered by 11 paras:

The code incorporating the testcase generator:

#! perl -slw use strict; use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } $|++; our $W //= 200; our $P //= 1000; my @idx = ( uniq( map int( rand 60000 ), 1 .. 2*$W ) )[ 0 .. $W-1 ]; print STDERR "Words: ", ~~@idx; my %paras; for my $p ( 1 .. $P ) { local $_ = join ' ', uniq map $idx[ rand @idx ], 0 .. rand( $W/2 ) +; $paras{ $p } //= ''; for my $i ( 0 .. $#idx ) { my $word = $idx[ $i ]; if( m[\b\Q$word]i ) { vec( $paras{ $p }, $i, 1 ) = 1; } } } print STDERR "Paras: ", scalar keys %paras; my $start = time; my @parasByWordCount = map{ unpack 'x[N]A*', $_ } sort map { pack 'NA*', unpack( '%32b*', $paras{ $_ } ), $_ } keys %paras; my @set = shift @parasByWordCount; my $mask = $paras{ $set[ 0 ] }; while( @parasByWordCount ) { my $next = shift @parasByWordCount; if( ( $mask | $paras{ $next } ) ne $mask ) { ## $Next contains new paras, so add to set push @set, $next; $mask |= $paras{ $next } } ## otherwise just discard it } printf STDERR "Took %.9f seconds\n", time() - $start; printf STDERR "All the words are covered by %d paras:\n", scalar @set +; printf "%3d : %s\n", $_, unpack 'b*', $paras{ $_ } for sort{ $a<=>$b } + @set; printf "All : %s\n", unpack 'b*', $mask; __END__ [ 5:17:45.68]C:\test>845818 -W=200 -P=600000 > nul Words: 200 Paras: 600000 Took 71.555999994 seconds All the words are covered by 200 paras:

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.