in reply to Re^4: Help required in optimizing the output (PERL)
in thread Help required in optimizing the output (PERL)

My script takes 25 secs on an old HP notebook or 10 secs on Mac Mini for 600 words, 612 paragraphs. I don't know how it scales though yet.
#!/usr/bin/perl use strict; use warnings; my $maxw = 600; # 200 my $maxp = 612; # 600_000 sub uniq { my %uniq; @uniq{@_} = (); return keys %uniq; } # uniq print {*STDERR} "Generate fake paragraphs\n"; my @para = map [uniq map int(rand $maxw),0..int(rand $maxw)],1..$maxp; print {*STDERR} "Sort\n"; @para = sort {@$a <=> @$b} @para; print {*STDERR} "Count reversed hash\n"; my %words_in_para; for(my $i=0;$i<=$#para;$i++){ foreach my $word(@{ $para[$i] }){ push @{ $words_in_para{$word} },$i; } } print {*STDERR} "Solving\n"; my $start_time = time(); my $i = 0; while ($i <= $#para){ my $para = $para[$i]; unless(grep 1 == @{ $words_in_para{$_} },@$para){ # all words appear somewhere else, delete the paragraph foreach my $word (@$para){ $words_in_para{$word} = [ grep $_ ne $i,@{ $words_in_para{$word} + } ]; } $para[$i] = []; } print {*STDERR} "$i/$maxp\r"; $i++; } my $end_time = time(); print "\n"; print join "\n",map {'['.join(' ',sort {$a<=>$b} @$_).']'} grep @$_,@p +ara; print "\n"; print $end_time - $start_time,"s\n";

Replies are listed 'Best First'.
Re^6: Help required in optimizing the output (600,000 x 200 in 30 minutes)
by BrowserUk (Patriarch) on Jun 23, 2010 at 00:52 UTC
    I don't know how it scales though yet.

    Projecting from:

    [ 0:53:04.64] c:\test>845979 1000 Generate fake paragraphs Sort Count reversed hash Solving 999/1000 [0 2 3 4 7 8 9 10 11 12 13 1 [4 6 7 8 10 11 12 13 14 15 1 [0 1 3 4 6 7 8 10 11 12 13 1 [0 5 6 7 8 9 10 12 14 15 18 11s [ 1:11:24.79] c:\test>845979 2000 Generate fake paragraphs Sort Count reversed hash Solving 1999/2000 [0 1 4 5 6 7 10 12 13 14 15 16 20 [1 3 5 6 7 9 11 13 14 16 17 18 19 [1 3 4 5 6 7 8 9 10 11 13 14 15 1 [1 2 6 7 8 9 10 12 14 19 20 21 22 50s [ 1:12:25.86] c:\test>845979 4000 Generate fake paragraphs Sort Count reversed hash Solving 3999/4000 [0 1 5 9 10 11 13 14 15 16 18 19 2 [0 1 4 8 9 11 13 14 15 16 18 19 20 [1 4 5 7 8 9 10 11 12 15 16 17 18 [2 3 4 5 6 7 8 9 11 13 17 18 19 21 220s

    Each doubling of the number of paras, multiplies the time taken by 4.4. If we double 1000 9 times we get 512,000 (close enough). So, 11 * 4.4^9 gives: 6799340 seconds == 79 days.

    Though you will run out of memory long before then on commodity hardware.

    Using a testcase generator along the lines of yours, my code gives:

    Note Like yours, I don't time the discovery phase. Just the solution phase.

    [ 1:31:22.89]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.000000 seconds All the words are covered by 9 paras: [ 1:31:38.04]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.008884907 seconds All the words are covered by 10 paras: [ 1:32:07.65]C:\test>845818 -W=200 -P=2000 > nul Words: 200 Paras: 2000 Took 0.020982027 seconds All the words are covered by 9 paras: [ 1:32:30.77]C:\test>845818 -W=200 -P=4000 > nul Words: 200 Paras: 4000 Took 0.046000004 seconds All the words are covered by 10 paras: [ 1:32:53.62]C:\test>845818 -W=200 -P=8000 > nul Words: 200 Paras: 8000 Took 0.248465061 seconds All the words are covered by 10 paras: [ 1:33:32.98]C:\test>845818 -W=200 -P=16000 > nul Words: 200 Paras: 16000 Took 1.472499847 seconds All the words are covered by 10 paras: [ 1:34:44.75]C:\test>845818 -W=200 -P=32000 > nul Words: 200 Paras: 32000 Took 6.217499971 seconds All the words are covered by 10 paras: [ 1:36:55.87]C:\test>845818 -W=200 -P=64000 > nul Words: 200 Paras: 64000 Took 25.427500010 seconds All the words are covered by 9 paras:

    It requires 4 more doublings to get from 64000 to 512,000. And 25.4 / 6.2 = 4.0. So starting with 25.4 * 4.0^4 = 6502.4 = 1.806 hours.

    To confirm the projection, I'll leave mine running on a 512,000 paras run while I sleep and post the results tomorrow.

    Update: It completed in 27 minutes and used apeek of 500MB::

    [ 1:45:29.21]C:\test>845818 -W=200 -P=512000 > nul Words: 200 Paras: 512000 Took 1657.894999981 seconds All the words are covered by 11 paras:

    The code incorporating the testcase generator:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } $|++; our $W //= 200; our $P //= 1000; my @idx = ( uniq( map int( rand 60000 ), 1 .. 2*$W ) )[ 0 .. $W-1 ]; print STDERR "Words: ", ~~@idx; my %paras; for my $p ( 1 .. $P ) { local $_ = join ' ', uniq map $idx[ rand @idx ], 0 .. rand( $W/2 ) +; $paras{ $p } //= ''; for my $i ( 0 .. $#idx ) { my $word = $idx[ $i ]; if( m[\b\Q$word]i ) { vec( $paras{ $p }, $i, 1 ) = 1; } } } print STDERR "Paras: ", scalar keys %paras; my $start = time; my @parasByWordCount = map{ unpack 'x[N]A*', $_ } sort map { pack 'NA*', unpack( '%32b*', $paras{ $_ } ), $_ } keys %paras; my @set = shift @parasByWordCount; my $mask = $paras{ $set[ 0 ] }; while( @parasByWordCount ) { my $next = shift @parasByWordCount; if( ( $mask | $paras{ $next } ) ne $mask ) { ## $Next contains new paras, so add to set push @set, $next; $mask |= $paras{ $next } } ## otherwise just discard it } printf STDERR "Took %.9f seconds\n", time() - $start; printf STDERR "All the words are covered by %d paras:\n", scalar @set +; printf "%3d : %s\n", $_, unpack 'b*', $paras{ $_ } for sort{ $a<=>$b } + @set; printf "All : %s\n", unpack 'b*', $mask; __END__ [ 5:17:45.68]C:\test>845818 -W=200 -P=600000 > nul Words: 200 Paras: 600000 Took 71.555999994 seconds All the words are covered by 200 paras:

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.