in reply to Re^2: Help required in optimizing the output (PERL)
in thread Help required in optimizing the output (PERL)

We can consider the words in a file to be each word in a separate line.. And in the other file containing paragraphs , we can consider each paragraph as a single line.. Each paragraph starts with > symbol.. So we can consider the occurrence of > symbol to be the start of new para.. I already have a hash which stores both the information a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.. The only challenge is to find the minimum set of paras that includes all the keywords. Help is greatly appreciated.
  • Comment on Re^3: Help required in optimizing the output (PERL)

Replies are listed 'Best First'.
Re^4: Help required in optimizing the output (PERL)
by BrowserUk (Patriarch) on Jun 22, 2010 at 15:49 UTC
    a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.

    As worded, both those hashes contain the same information? Please clarify.

    Also, how many paras (not lines, or GB)? And how many (key)words?

    Finding the optimal solution is NP-hard--brute force is essentially the only way. But there are some efficient mechanisms for testing permutations; and others that can help prune the search space; but we need to know the scale of the parameters.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^4: Help required in optimizing the output (PERL)
by BrowserUk (Patriarch) on Jun 22, 2010 at 20:08 UTC

    This uses bitstrings to represent the keywords contained within each para. This means:

    • Storing the information about which of the 600,000 paras contain which of the 200 words requires 600,000 x (200/8) + hashing overheads ~= 150 MB. Doable on most systems.
    • Performing union, intersection/symmetric difference etc. set operations is very fast.

      My testcase looks for 599 words in 612 paragraphs and whittles it down to 34 paras in 16 seconds.

      Naively projecting that, it should deal with 200 words in 600,000 paras in 1 1/2 hours?

    It may make some sense to perform a second filtration to attempt to exclude further paras where a later inclusion covered the additions of an earlier inclusion. This might even be made iterative until no further exclusions can be made.

    For my testcase this refinement made no difference, so I removed the code.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      My script takes 25 secs on an old HP notebook or 10 secs on Mac Mini for 600 words, 612 paragraphs. I don't know how it scales though yet.
        I don't know how it scales though yet.

        Projecting from:

        [ 0:53:04.64] c:\test>845979 1000 Generate fake paragraphs Sort Count reversed hash Solving 999/1000 [0 2 3 4 7 8 9 10 11 12 13 1 [4 6 7 8 10 11 12 13 14 15 1 [0 1 3 4 6 7 8 10 11 12 13 1 [0 5 6 7 8 9 10 12 14 15 18 11s [ 1:11:24.79] c:\test>845979 2000 Generate fake paragraphs Sort Count reversed hash Solving 1999/2000 [0 1 4 5 6 7 10 12 13 14 15 16 20 [1 3 5 6 7 9 11 13 14 16 17 18 19 [1 3 4 5 6 7 8 9 10 11 13 14 15 1 [1 2 6 7 8 9 10 12 14 19 20 21 22 50s [ 1:12:25.86] c:\test>845979 4000 Generate fake paragraphs Sort Count reversed hash Solving 3999/4000 [0 1 5 9 10 11 13 14 15 16 18 19 2 [0 1 4 8 9 11 13 14 15 16 18 19 20 [1 4 5 7 8 9 10 11 12 15 16 17 18 [2 3 4 5 6 7 8 9 11 13 17 18 19 21 220s

        Each doubling of the number of paras, multiplies the time taken by 4.4. If we double 1000 9 times we get 512,000 (close enough). So, 11 * 4.4^9 gives: 6799340 seconds == 79 days.

        Though you will run out of memory long before then on commodity hardware.

        Using a testcase generator along the lines of yours, my code gives:

        Note Like yours, I don't time the discovery phase. Just the solution phase.

        [ 1:31:22.89]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.000000 seconds All the words are covered by 9 paras: [ 1:31:38.04]C:\test>845818 -W=200 -P=1000 > nul Words: 200 Paras: 1000 Took 0.008884907 seconds All the words are covered by 10 paras: [ 1:32:07.65]C:\test>845818 -W=200 -P=2000 > nul Words: 200 Paras: 2000 Took 0.020982027 seconds All the words are covered by 9 paras: [ 1:32:30.77]C:\test>845818 -W=200 -P=4000 > nul Words: 200 Paras: 4000 Took 0.046000004 seconds All the words are covered by 10 paras: [ 1:32:53.62]C:\test>845818 -W=200 -P=8000 > nul Words: 200 Paras: 8000 Took 0.248465061 seconds All the words are covered by 10 paras: [ 1:33:32.98]C:\test>845818 -W=200 -P=16000 > nul Words: 200 Paras: 16000 Took 1.472499847 seconds All the words are covered by 10 paras: [ 1:34:44.75]C:\test>845818 -W=200 -P=32000 > nul Words: 200 Paras: 32000 Took 6.217499971 seconds All the words are covered by 10 paras: [ 1:36:55.87]C:\test>845818 -W=200 -P=64000 > nul Words: 200 Paras: 64000 Took 25.427500010 seconds All the words are covered by 9 paras:

        It requires 4 more doublings to get from 64000 to 512,000. And 25.4 / 6.2 = 4.0. So starting with 25.4 * 4.0^4 = 6502.4 = 1.806 hours.

        To confirm the projection, I'll leave mine running on a 512,000 paras run while I sleep and post the results tomorrow.

        Update: It completed in 27 minutes and used apeek of 500MB::

        [ 1:45:29.21]C:\test>845818 -W=200 -P=512000 > nul Words: 200 Paras: 512000 Took 1657.894999981 seconds All the words are covered by 11 paras:

        The code incorporating the testcase generator:

        #! perl -slw use strict; use Time::HiRes qw[ time ]; sub uniq{ my %x; @x{@_} = (); keys %x } $|++; our $W //= 200; our $P //= 1000; my @idx = ( uniq( map int( rand 60000 ), 1 .. 2*$W ) )[ 0 .. $W-1 ]; print STDERR "Words: ", ~~@idx; my %paras; for my $p ( 1 .. $P ) { local $_ = join ' ', uniq map $idx[ rand @idx ], 0 .. rand( $W/2 ) +; $paras{ $p } //= ''; for my $i ( 0 .. $#idx ) { my $word = $idx[ $i ]; if( m[\b\Q$word]i ) { vec( $paras{ $p }, $i, 1 ) = 1; } } } print STDERR "Paras: ", scalar keys %paras; my $start = time; my @parasByWordCount = map{ unpack 'x[N]A*', $_ } sort map { pack 'NA*', unpack( '%32b*', $paras{ $_ } ), $_ } keys %paras; my @set = shift @parasByWordCount; my $mask = $paras{ $set[ 0 ] }; while( @parasByWordCount ) { my $next = shift @parasByWordCount; if( ( $mask | $paras{ $next } ) ne $mask ) { ## $Next contains new paras, so add to set push @set, $next; $mask |= $paras{ $next } } ## otherwise just discard it } printf STDERR "Took %.9f seconds\n", time() - $start; printf STDERR "All the words are covered by %d paras:\n", scalar @set +; printf "%3d : %s\n", $_, unpack 'b*', $paras{ $_ } for sort{ $a<=>$b } + @set; printf "All : %s\n", unpack 'b*', $mask; __END__ [ 5:17:45.68]C:\test>845818 -W=200 -P=600000 > nul Words: 200 Paras: 600000 Took 71.555999994 seconds All the words are covered by 200 paras:

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.