Re^5: Help required in optimizing the output (PERL)

My script takes 25 secs on an old HP notebook or 10 secs on Mac Mini for 600 words, 612 paragraphs. I don't know how it scales though yet.

#!/usr/bin/perl

use strict;
use warnings;

my $maxw = 600; # 200
my $maxp = 612; # 600_000

sub uniq {
  my %uniq;
  @uniq{@_} = ();
  return keys %uniq;
} # uniq

print {*STDERR} "Generate fake paragraphs\n";
my @para = map [uniq map int(rand $maxw),0..int(rand $maxw)],1..$maxp;

print {*STDERR} "Sort\n";
@para = sort {@$a <=> @$b} @para;


print {*STDERR} "Count reversed hash\n";
my %words_in_para;
for(my $i=0;$i<=$#para;$i++){
  foreach my $word(@{ $para[$i] }){
    push @{ $words_in_para{$word} },$i;
  }
}

print {*STDERR} "Solving\n";
my $start_time = time();
my $i = 0;
while ($i <= $#para){
  my $para = $para[$i];

  unless(grep 1 == @{ $words_in_para{$_} },@$para){
    # all words appear somewhere else, delete the paragraph
    foreach my $word (@$para){
      $words_in_para{$word} = [ grep $_ ne $i,@{ $words_in_para{$word}
+ } ];
    }
    $para[$i] = [];
  }
  print {*STDERR} "$i/$maxp\r";
  $i++;
}

my $end_time = time();

print "\n";
print join "\n",map {'['.join(' ',sort {$a<=>$b} @$_).']'} grep @$_,@p
+ara;
print "\n";
print $end_time - $start_time,"s\n";
[download]

Comment on Re^5: Help required in optimizing the output (PERL) Download Code

Replies are listed 'Best First'.
Re^6: Help required in optimizing the output (600,000 x 200 in 30 minutes) by BrowserUk (Patriarch) on Jun 23, 2010 at 00:52 UTC
I don't know how it scales though yet. Projecting from: [ 0:53:04.64] c:\test>845979 1000 Generate fake paragraphs Sort Count reversed hash Solving 999/1000 [0 2 3 4 7 8 9 10 11 12 13 1 [4 6 7 8 10 11 12 13 14 15 1 [0 1 3 4 6 7 8 10 11 12 13 1 [0 5 6 7 8 9 10 12 14 15 18 11s [ 1:11:24.79] c:\test>845979 2000 Generate fake paragraphs Sort Count reversed hash Solving 1999/2000 [0 1 4 5 6 7 10 12 13 14 15 16 20 [1 3 5 6 7 9 11 13 14 16 17 18 19 [1 3 4 5 6 7 8 9 10 11 13 14 15 1 [1 2 6 7 8 9 10 12 14 19 20 21 22 50s [ 1:12:25.86] c:\test>845979 4000 Generate fake paragraphs Sort Count reversed hash Solving 3999/4000 [0 1 5 9 10 11 13 14 15 16 18 19 2 [0 1 4 8 9 11 13 14 15 16 18 19 20 [1 4 5 7 8 9 10 11 12 15 16 17 18 [2 3 4 5 6 7 8 9 11 13 17 18 19 21 220s [download] Each doubling of the number of paras, multiplies the time taken by 4.4. If we double 1000 9 times we get 512,000 (close enough). So, 11 * 4.4^9 gives: 6799340 seconds == 79 days. Though you will run out of memory long before then on commodity hardware. Using a testcase generator along the lines of yours, my code gives: Note Like yours, I don't time the discovery phase. Just the solution phase. [ 1:31:22.89]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.000000 seconds All the words are covered by 9 paras: [ 1:31:38.04]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.008884907 seconds All the words are covered by 10 paras: [ 1:32:07.65]C:\test>845818 -W=200 -P=2000 > nul Words: 200 Paras: 2000 Took 0.020982027 seconds All the words are covered by 9 paras: [ 1:32:30.77]C:\test>845818 -W=200 -P=4000 > nul Words: 200 Paras: 4000 Took 0.046000004 seconds All the words are covered by 10 paras: [ 1:32:53.62]C:\test>845818 -W=200 -P=8000 > nul Words: 200 Paras: 8000 Took 0.248465061 seconds All the words are covered by 10 paras: [ 1:33:32.98]C:\test>845818 -W=200 -P=16000 > nul Words: 200 Paras: 16000 Took 1.472499847 seconds All the words are covered by 10 paras: [ 1:34:44.75]C:\test>845818 -W=200 -P=32000 > nul Words: 200 Paras: 32000 Took 6.217499971 seconds All the words are covered by 10 paras: [ 1:36:55.87]C:\test>845818 -W=200 -P=64000 > nul Words: 200 Paras: 64000 Took 25.427500010 seconds All the words are covered by 9 paras: [download] It requires 4 more doublings to get from 64000 to 512,000. And 25.4 / 6.2 = 4.0. So starting with 25.4 * 4.0^4 = 6502.4 = 1.806 hours. To confirm the projection, I'll leave mine running on a 512,000 paras run while I sleep and post the results tomorrow. Update: It completed in 27 minutes and used apeek of 500MB:: `[ 1:45:29.21]C:\test>845818 -W=200 -P=512000 > nul Words: 200 Paras: 512000 Took 1657.894999981 seconds All the words are covered by 11 paras:` [download] The code incorporating the testcase generator: #! perl -slw use strict; use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } $\|++; our $W //= 200; our $P //= 1000; my @idx = ( uniq( map int( rand 60000 ), 1 .. 2$W ) )[ 0 .. $W-1 ]; print STDERR "Words: ", ~~@idx; my %paras; for my $p ( 1 .. $P ) { local $_ = join ' ', uniq map $idx[ rand @idx ], 0 .. rand( $W/2 ) +; $paras{ $p } //= ''; for my $i ( 0 .. $#idx ) { my $word = $idx[ $i ]; if( m[\b\Q$word]i ) { vec( $paras{ $p }, $i, 1 ) = 1; } } } print STDERR "Paras: ", scalar keys %paras; my $start = time; my @parasByWordCount = map{ unpack 'x[N]A', $_ } sort map { pack 'NA', unpack( '%32b', $paras{ $_ } ), $_ } keys %paras; my @set = shift @parasByWordCount; my $mask = $paras{ $set[ 0 ] }; while( @parasByWordCount ) { my $next = shift @parasByWordCount; if( ( $mask \| $paras{ $next } ) ne $mask ) { ## $Next contains new paras, so add to set push @set, $next; $mask \|= $paras{ $next } } ## otherwise just discard it } printf STDERR "Took %.9f seconds\n", time() - $start; printf STDERR "All the words are covered by %d paras:\n", scalar @set +; printf "%3d : %s\n", $_, unpack 'b', $paras{ $_ } for sort{ $a<=>$b } + @set; printf "All : %s\n", unpack 'b', $mask; __END__ [ 5:17:45.68]C:\test>845818 -W=200 -P=600000 > nul Words: 200 Paras: 600000 Took 71.555999994 seconds All the words are covered by 200 paras: [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^6: Help required in optimizing the output (600,000 x 200 in 30 minutes)
by BrowserUk (Patriarch) on Jun 23, 2010 at 00:52 UTC

I don't know how it scales though yet.

Projecting from:

[ 0:53:04.64] c:\test>845979 1000
Generate fake paragraphs
Sort
Count reversed hash
Solving
999/1000
[0 2 3 4 7 8 9 10 11 12 13 1
[4 6 7 8 10 11 12 13 14 15 1
[0 1 3 4 6 7 8 10 11 12 13 1
[0 5 6 7 8 9 10 12 14 15 18
11s

[ 1:11:24.79] c:\test>845979 2000
Generate fake paragraphs
Sort
Count reversed hash
Solving
1999/2000
[0 1 4 5 6 7 10 12 13 14 15 16 20
[1 3 5 6 7 9 11 13 14 16 17 18 19
[1 3 4 5 6 7 8 9 10 11 13 14 15 1
[1 2 6 7 8 9 10 12 14 19 20 21 22
50s

[ 1:12:25.86] c:\test>845979 4000
Generate fake paragraphs
Sort
Count reversed hash
Solving
3999/4000
[0 1 5 9 10 11 13 14 15 16 18 19 2
[0 1 4 8 9 11 13 14 15 16 18 19 20
[1 4 5 7 8 9 10 11 12 15 16 17 18
[2 3 4 5 6 7 8 9 11 13 17 18 19 21
220s
[download]

Each doubling of the number of paras, multiplies the time taken by 4.4. If we double 1000 9 times we get 512,000 (close enough). So, 11 * 4.4^9 gives: 6799340 seconds == 79 days.

Though you will run out of memory long before then on commodity hardware.

Using a testcase generator along the lines of yours, my code gives:

Note Like yours, I don't time the discovery phase. Just the solution phase.

[ 1:31:22.89]C:\test>845818 -W=200 -P=1000 > nul
Words: 200
Paras: 1000
Took 0.000000 seconds
All the words are covered by 9 paras:

[ 1:31:38.04]C:\test>845818 -W=200 -P=1000 > nul
Words: 200
Paras: 1000
Took 0.008884907 seconds
All the words are covered by 10 paras:

[ 1:32:07.65]C:\test>845818 -W=200 -P=2000 > nul
Words: 200
Paras: 2000
Took 0.020982027 seconds
All the words are covered by 9 paras:

[ 1:32:30.77]C:\test>845818 -W=200 -P=4000 > nul
Words: 200
Paras: 4000
Took 0.046000004 seconds
All the words are covered by 10 paras:

[ 1:32:53.62]C:\test>845818 -W=200 -P=8000 > nul
Words: 200
Paras: 8000
Took 0.248465061 seconds
All the words are covered by 10 paras:

[ 1:33:32.98]C:\test>845818 -W=200 -P=16000 > nul
Words: 200
Paras: 16000
Took 1.472499847 seconds
All the words are covered by 10 paras:

[ 1:34:44.75]C:\test>845818 -W=200 -P=32000 > nul
Words: 200
Paras: 32000
Took 6.217499971 seconds
All the words are covered by 10 paras:

[ 1:36:55.87]C:\test>845818 -W=200 -P=64000 > nul
Words: 200
Paras: 64000
Took 25.427500010 seconds
All the words are covered by 9 paras:
[download]

It requires 4 more doublings to get from 64000 to 512,000. And 25.4 / 6.2 = 4.0. So starting with 25.4 * 4.0^4 = 6502.4 = 1.806 hours.

To confirm the projection, I'll leave mine running on a 512,000 paras run while I sleep and post the results tomorrow.

Update: It completed in 27 minutes and used apeek of 500MB::

[ 1:45:29.21]C:\test>845818 -W=200 -P=512000 > nul
Words: 200
Paras: 512000
Took 1657.894999981 seconds
All the words are covered by 11 paras:
[download]

The code incorporating the testcase generator:

#! perl -slw
use strict;
use Time::HiRes qw[ time ];

sub uniq{ my %x; @x{@_} = (); keys %x }

$|++;
our $W //= 200;
our $P //= 1000;

my @idx = ( uniq( map int( rand 60000 ), 1 .. 2*$W ) )[ 0 .. $W-1 ];
print STDERR "Words: ", ~~@idx;

my %paras;
for my $p ( 1 .. $P ) {
    local $_ = join ' ', uniq map $idx[ rand @idx ], 0 .. rand( $W/2 )
+;

    $paras{ $p } //= '';
    for my $i ( 0 .. $#idx ) {
        my $word = $idx[ $i ];
        if( m[\b\Q$word]i ) {
            vec( $paras{ $p }, $i, 1 ) = 1;
        }
    }
}
print STDERR "Paras: ", scalar keys %paras;

my $start = time;
my @parasByWordCount = map{
    unpack 'x[N]A*', $_
} sort map {
    pack 'NA*', unpack( '%32b*', $paras{ $_ } ), $_
} keys %paras;

my @set = shift @parasByWordCount;
my $mask = $paras{ $set[ 0 ] };

while( @parasByWordCount ) {
    my $next = shift @parasByWordCount;
    if( ( $mask | $paras{ $next } ) ne $mask ) {
        ## $Next contains new paras, so add to set
        push @set, $next;
        $mask |= $paras{ $next }
    }
    ## otherwise just discard it
}
printf STDERR "Took %.9f seconds\n", time() - $start;

printf  STDERR "All the words are covered by %d paras:\n", scalar @set
+;
printf "%3d : %s\n", $_, unpack 'b*', $paras{ $_ } for sort{ $a<=>$b }
+ @set;
printf "All : %s\n", unpack 'b*', $mask;

__END__
[ 5:17:45.68]C:\test>845818 -W=200 -P=600000 > nul
Words: 200
Paras: 600000
Took 71.555999994 seconds
All the words are covered by 200 paras:
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]