in reply to Re^3: Help required in optimizing the output (PERL)
in thread Help required in optimizing the output (PERL)
This uses bitstrings to represent the keywords contained within each para. This means:
My testcase looks for 599 words in 612 paragraphs and whittles it down to 34 paras in 16 seconds.
Naively projecting that, it should deal with 200 words in 600,000 paras in 1 1/2 hours?
It may make some sense to perform a second filtration to attempt to exclude further paras where a later inclusion covered the additions of an earlier inclusion. This might even be made iterative until no further exclusions can be made.
For my testcase this refinement made no difference, so I removed the code.
#! perl -slw use strict; ## Load the words list my @idx; { local( @ARGV ) = '845818.idx'; @idx = <>; }; chomp @idx; print "Words: ", ~~@idx; $/ = ''; ## Paragraph mode ## hash of bitstrings; keyed by para number; ## each bit represents one word in @idx my %paras; while( <> ) { $paras{ $. } //= ''; ## Initalise bitstring for my $i ( 0 .. $#idx ) { ## for each word index my $word = $idx[ $i ]; ## get the word if( m[\b\Q$word]i ) { ## If it is found in the paragraph ## Set the bit for this word in the bit string for this pa +ra vec( $paras{ $. }, $i, 1 ) = 1; } } } print "Paras: ", scalar keys %paras; ## Order keys (para numbers) by the bits set ## (words found) in bitstring (descending) my @parasByWordCount = sort { unpack( '%32b*', $paras{ $b } ) <=> unpack( '%32b*', $paras{ $a } +) } keys %paras; ## Initialise the minimal set with the para ## that contains the most words my @set = shift @parasByWordCount; ## And initalise the mask to its bitstring my $mask = $paras{ $set[ 0 ] }; ## While there are still paras to consider ## (considering them in descending order) while( @parasByWordCount ) { ## Get the next one to be considered my $next = shift @parasByWordCount; ## If adding (ORing) the next bitstring with the mask makes a diff +erence if( ( $mask | $paras{ $next } ) ne $mask ) { ## $next contains new paras not yet covered, ## so add it to the set and the mask push @set, $next; $mask |= $paras{ $next } } ## otherwise just discard it } ## Now we have a set of paras that covers all the words ## Is it minimal? For my artificial testcase it appears to be ## But I offer no proof. That's for others (probably to discount). printf "All the words are covered by %d paras:\n", scalar @set; printf "%3d : %s\n", $_, unpack 'b*', $paras{ $_ } for sort{ $a<=>$b } + @set; printf "All : %s\n", unpack 'b*', $mask; __END__ c:\test>845818 fannyhill.txt Words: 599 Paras: 612 All the words are covered by 34 paras: 62 : 1100001001100000100011100000111000000110010000000000000000000110 +110100000100000000000000000010110000001000100000000010000001100000000 +000000100010000000000100000001001000000010011000000000100010100000000 +111001000001011001000111100000001000110000100101100110000000110010000 +000000000010000010000000000000000000000001000000000101000001111000010 +100000011010000000101011000101101100000000000000100000010110001010000 +100000000101100001000000001000000100101000000010010000000100011100001 +001001000000000000100001000000100011000101001110000101000100011100000 +0010001000001000011101101110110100100001110100001000 82 : 1110000000001000101011100011011011010100000000100000011000000110 +110000000010000000000100000010000110101000111000100000100010000001000 +000000101011000000000100110000000001010000110000000000000000100010000 +011001110010011100100000001010000001101000110000000010000001011001100 +010000000010000000100000000100010000001001000000000000010000011101100 +001000101110000000110110011100011000010010000000000000000000010001000 +000100000001000000000000010000000000000100001001010000000000000100000 +000011001000000001100000000001000011010110010010010010010101100010000 +0001110000010000000101001100001101000001101000111000 139 : 1110000001101000100011111000011011010110000010000010011000100110 +100000000000100000001000000001000000000000000100000100000000010001000 +000000101000000000000000000011000001010000010011001100000010110000000 +011011100001011001000110001110000000101000100000000111011000011000000 +010000001100000010101000000010000000000001000000000001001001000000010 +000000001100100100100010100110010001000000000000000000010000001010000 +000000000000000000000000000000000101100000000000011000001010000100101 +000000001000000010000100000000100111111110010010000000001100110000000 +0000111000001010000111001100000100010000100000101100 197 : 1100000000000000100011100000011011000100010100001010000100100110 +100000010000000010000100000000010000001000100000000001000000000000011 +000000100000010000100000000110001001000000110001000000000010100000000 +011001101000011001000000011110001000101000100000000011001000111000000 +010000000000000010100000000111100000001001100000000000000000010100010 +000000001010110000110010111000001000010010000001010010010000001010000 +011010000000000001000000000110100101011000000000000010000000000000000 +000001010000000000000000000000100011011110000010000000001100011010001 +0101100000000000000001001100100101110000100100110000 248 : 1101000101100000100011111000011011000101001100000000000000000100 +000000001000000000000100110000100000000000000001000000010000001000011 +000000010010000000000000110000000001110100011000001000001100100100000 +011000000001011010000110101110101000100000110001001000000000000000010 +001000000001000100101100010000000011000001000000000000000011010000000 +000000001011000000110010100110000100011000001101000000000000001000000 +001000100000010000100000000001000100000000000000011000010000000100000 +000000000000000001000000011100000111000110001010000000010101011010000 +0000110000000000000100000000000100011000100000101001 253 : 1110000001100000001011100011011011000100111100000011010000000110 +001000000100000010000100000000000111101000100100000000000000000001011 +000000000000010010000000110001000001000010010000001100000000100010000 +011101100001111000000000001110001100100000100010000010000000010010000 +111100000100111001001010000000000000001001000000100011000001011000001 +000101001110100000110010000100000000010011000000000011010000001010010 +000110000000010010000000000100100000000100000010000000010000000100001 +010100001001000000100000000001101111000100101010011000110100011000000 +000100000010100000010000001110010101100010110 290 : 1100000001100001101011100000011011000100000000000000010000010010 +110000000000110000001110000010000001001110100000000001000000100000000 +000000010000000000000000100001000011000000010101001010101000100110000 +011001100000111000000110101111000000100000110010000010000000000000000 +000000100110000001001100000001000000000011000000100100001101110000000 +000000001011100000100010010000011000100000000010010000000001000001000 +000000000000000000001010000000100101000000010000011100000100001100100 +010000000010101010100010000100000111010000010010011000001100000010001 +0000110000010100000111000000010100110001100000101010 307 : 1100000000000000100011100011011011010110000010100001011000100111 +100000000010000000000000000000001000001000100000000001100001001001011 +010000101100010100000100000010000001000000010001000010000000100000000 +011100000000011000000000001110000000100000100001001011011001000010000 +010000001010100001111000010001100100000001001001100001001000010000010 +000001001010100000110011101000111000010010001100000000010000000000100 +000000000000000000000000000000100101000100000000110000101000000110101 +000010011000100000000001000000000011000101001000100000110110000000000 +0001100000010000011111101100000111011000111000111001 326 : 1101000010000010100011100000100011000100001000000000110000000110 +000000000000000010000100000000001001000000000000010000000010000001000 +000100100000010100001000000000000001001000110000000000000010000001010 +011001110011011000010111001010000000100000110000000011000000000000000 +001000000010101000001000000001000000000011000000000001000001010001010 +000000001110000000100010100100000001000000010011000000110000010000000 +110010000001100010010000000000100000000100000000011000010000000100000 +000000000000110010010000011110000001110000100010000001000100010000000 +0000000000010000100101001100000101010010100000101100 335 : 1100000000001000100011110011111011011100001100100010000000010110 +001000000110100100000100000000111000001000100000010000100011100000011 +001100001000100010110000111000001001100000011011000100001000110001000 +011001000010111101001000011111001100101000110001001111110111011000000 +010110011000101011110000000001011000001001001000100101011000011100011 +000000001111100000110010110000101000010110000000000100110110011000000 +001000000111010001100010000100001000000001101000010000011000001111001 +000000001100000000000001011100100111110111110010110000100100011100000 +0001111100100100011111111100100111010011111000010000 383 : 1111010000000010100011111100011011001100000100001000001001100110 +100000000100100100000110000000000000001000100000000000000010100100011 +000000010010010010000010000000000011100000010000000000001100110010000 +011101100001011011000000001110011000110001100101000100000100000010000 +001000101000101001000000100001000000000001000010001001010000011000000 +000100001110000000100010110100101000000000001000000000000000000000010 +000000100000000000000000000101000100000000000000000000100000000100100 +000000001000000000000000000001101101110010100010100001111110011001000 +0001000000000000000101101110100100011100101000011000 384 : 1100010001100010100011111000011011000100110000000000010000000111 +100000000000000000100101000000000000001010000100000000000010000001000 +000011001000000000001001001001000011000000010000000000000000110000000 +011000000001111100000000001010011000101000100100001010000000000000010 +000000100000010000101000000000000010000000000000000001000001010001000 +000110011110000001100111100101110000010010000010000000000000001010000 +001000000000000000001000000000000111000000000000010000010000000100100 +000001000010000010100000010100000111010110001110111001110110000000000 +0000111000000000000101011110100101010000101010001100 385 : 1100000000000000110111110100011011001100010100000010010100000110 +100000000110000010100100000000000100011000100100000000010000000001000 +000111000010000010110100000011001001000010010100000100000001100010000 +011001000000011000101000001110001101101000110011001111000100111010000 +111000001010000001000000000001101010000001000000001000000001010000010 +010011001010100000110011111000110001010010000000000010010100011000000 +011000100000000000100000000000000000000100001010000000001000000100100 +000000000001000000000100000100000111010110010010011000010101011110000 +0001111101000001010111001110010101000000101000111100 392 : 1101000001100000100011101111011011111100001010100010000000000110 +110010000000000001000101000100001110011001000000000000000000000100011 +001000110011100011100000001011001001001010110001100010000000000001000 +011100000001011100000000001010000000100000100010000011000000100000000 +011000010000000000101000000001001010000001000000000010000000010111010 +001001001011110010110011100100011100010000001100010100000000010000000 +000000000000000000000110100000000110000000101010000000010100010110110 +000000000001001000000000000000000111010110110010100000001100011010100 +0000111000010000000111100000110101110010101000101001 393 : 1100001011111000110011100011011011010100001100101000100000000110 +100010001000100010100100000001000111001110110000000010100000100001010 +000010100010000011000110110000000101010010010000001010001110000000000 +011001000001011011100000001010001000100000100001011110001000010100001 +010100000000101001001010000000001010000001110001111000010001111000010 +000100001110110001110010011100011001010011000010000000000101010000000 +000110001100000000000010001000100110100110100000000000000101000110100 +000000001000101011100001011001000011110111001010011000011101011000001 +1001110000010100000111110010110100000000100000101101 395 : 1110000001101100100011100011111011000110000100101001000010000110 +001100001000000000001100000000000100001010100001001000000010100001010 +000010100000100000001110110011010001010000010000001000000001110101001 +111001101011111011000000001010010000110001100000011011001001010100100 +001010101100110010111000010000001010001001111111000011010000011100001 +000001001011110100100010111101011101010111000001010001110001011000000 +001010100001010000100100010000100110011101000100000000000100100110111 +000000011000110110000011000000110111010110001110000000011100111100010 +0000110000010000000101001110100101010000101100011100 398 : 1100000001100000100111101000011011000100001100100100000000001111 +001000001000101000110110000000000101001000100000000000000000000000000 +000000100000010000100000000001000011000010000000000000101100100010000 +111000000000011000000111001111000000100010100000001011000101100010000 +000000101000100001000000100101101000011011110000101001000001010001010 +000001001010100000101111000100111000010011000000010110011101100000000 +101001000000000000000000000000100100000000001000010000010011000110001 +000001000000110000000000011000000111010000001110000000000101011100010 +0001100000000000100111001110110110010100101100011000 404 : 1100000001110010100011110000011011011100011100101111010010001111 +000011000000010100000101000111011111101000100001000000000111000001011 +000110100010100011111100110001000001100000011001101011101000101101000 +011001110001011111000110001011001110100101110011101111011000011110100 +110000001100000010101000000101001010000001001011110011011001011101111 +000000011011110001100011111101011111010000001101000010110100101000000 +001000010000000011101110001000100101001110000001000010001100000110101 +001000101000000010001000011000101111110111011010111011110101111101001 +0001111000111100011101101110110101011001101000111101 415 : 1110000000000001100011100000011011000100000000001010001001000111 +001000000110000100001100000000111000001000000100101010000001000000011 +100000100000010100110000001001001001010000011001000000100010100100000 +110101000001111101000110111110000000100010100001111011011111011000000 +101001011000011000000000011101000000000001001000100000010000000000000 +000010001011100100100011101100111001010000000000000000110000001110000 +000100001010000001000001000000000100100000000000011111101011000110001 +000100000010000001000011000010000111110100110000000010011100000010101 +101011100010101000010100111011010101000011100 421 : 1110100000001100100011100011011011000101000001110010000001000111 +000000000010000000000100000011011110001000100000111000100001100000011 +000000101100000010000100000110000001010000010001101100111000000000000 +111001000000011101000110101111000010101000100000011011111001010000001 +010101011000100011100000010111000000001001101001100101001000000001111 +000000001010100000110010101001111100000010000000001100010100000000001 +000010000001100011000000000101000000000000000000011000001110100110000 +000110011010110000000001011000100111100110110010000011011110011110100 +0010111000000000000101001100100101111010100100101001 424 : 1110100001101000100011100000011011001101100010100010011110010111 +001100000010010000001100000001001001001000100110100001100011000001011 +000110111110010000000100001001101011011000011001001100001001000001000 +011001101000011100000110011111000000111000110001001111010101011000000 +010011011001110001100000000101000100000001000001100011000001000101010 +111001001011100000111010100100011000110110001101001101110000001101000 +000000101000000111011111000110000100101000001000011010001100001110101 +110010000010100010000011100000100111110110110010111001111110011100001 +0100111000001100010101111100110101110010101000101001 433 : 1100100000001100100011100100011011000110001110101100001000000111 +100100000010000000001000000001011101101001100000000000100010100101011 +000110110010110000110110110010000101000010010001110000000001000001101 +011100000001011100100000001011001100100001110001011011011000011010100 +010000000000101000000001000001000000001001000001000001011001010100000 +001000011010100000110011100100111000010010010000001110011100011000001 +000101100101000000011010000111110000000100100000111101000100000100001 +101000011010000010000001000010100111100101110010100010110110010110100 +0001111000101000011101101110111100011000111000010000 435 : 1110001001100000000011111000011011100100110110000000000000100101 +000011100001100100000101000000000000000000001000000000000010110011011 +010000000000000000001000000000000001001010010000101011000000100010000 +011000000001011000000110000010000000100100100000000011001100000010000 +000000100100000000000000000001000010000001000000000000100011010001100 +000000011010100101110010100110011010111010000001010010111000001000100 +011000010001001001000000000000000100001000000000110000110000000100000 +000000000000000010000001000000100011000010001000000000010100000001000 +0000101000010000000101010011010100010100101000100000 442 : 1101000001100011100011100100111111000100001010000100000000100111 +000011000000100011100100000010000001000000001000000001000000000011000 +000000100000010000000001000001000011000010010001001000100010100010000 +011000000001111000000000001111000000100000110100001110000110000000001 +000000010000100101000000000001001010100001000000100111010001011011000 +000110101111101001111011101101110100000011000001010000000000000000010 +001001100000000001000000001000000100001000111000000000101010000100000 +010000001011000000000000000000000011110010100110111000010101010001010 +0001110000010000100111011110101101110011101100101100 448 : 1101000001100001100011100000111011000100000100000100000000011110 +001000100001010001000100000000000000000000000000000001100010100101011 +000000000000011000000000001110000001000011010000000011000000100101000 +011000000000011000000111001110001100100000100100001110000001000100001 +001000001010000000100000000001000011000001000010100000000011010001110 +000010001011100000100111100101111001000011000001011000010100000000100 +010000001000001000000000010000000000000000001000010000000000000100000 +010000000000000010001000000010000111000111001000111100001100011000000 +000111000001101100011101110010010101000010110 456 : 1110000101101100110011101000111011010110011110101001011100000111 +000011101101000011000100111011010001001110000000001111000010010001011 +000000001110100000111100110011000011010000011111000010101010100101000 +011011001001011001010110011110001101101000110101011010001101011010000 +011000001110001001100000000011011010010001100010110001111011011100010 +000011101010100100110011111100011101010010101100000101010100011011101 +001000000001001101100100101110100111100001010000011010010110100110111 +100001010000000110100100000010101111110110011110111011101100110110010 +0000111110000100010111101110111111011000111010110000 458 : 1110000001101000100011101000111011000110001010100000100000000110 +001011000100000000000101100001000001001100100000101101000010010000011 +101000000000000010000000000000000001000000010000001010000010100111010 +011010000000111100000000001111001101111000100010000010011001010000001 +010000001011001001110000000001110110000001001000101100010000011000000 +000010001011100010100011010101011001000000100000001000010100001010010 +001000011000000011000100010001000000000111001000010010011000000100001 +000000001100100010100010000000101011010100011010011000011100011010001 +0100110000101100000111101110100111010000110001010000 467 : 1100000010001101100011101011011011011100011001101010110100010110 +110010000000011000000001000010010111001010101100001000000010000001110 +000000110011100000111001001010000001010010010001000000001010110010000 +011001110001011100001111001111001100101000110000001111000000010010100 +010001000000101101101000000001010010001001101000100001000000011100000 +001000101010100001100011100101111100010110101110001000100100010001000 +010000000000010001111010001000100100000000100000011010111000000110000 +000111011000000010100001000010101111010101001010111110011101111010001 +1011100000011001000111011110110101011001101001001100 472 : 1110000100000010000011111100011011000100000000100000000000001110 +000000000001100000010001110010100100000000000000000100001000110001000 +000000000000010100000010000000000001000000011000111000000000000001001 +011001110001011101000110001110010000100000110000000110001100000000000 +000000000100100001001000001101000000000001000000000001000110010000000 +000000001010100000111110110101000010010100000001000000010100000000000 +000000000000000000000000000100000100010000000000010000000000000100000 +010100001001000010000000000010101011100110100011110001000100000011001 +0001001110000000000111001110110100010100100000111100 499 : 1100000000001100100011100000011011001110101110000000101001010110 +000011000000000010000100000010001100111000100000001000000010010000010 +011010001100010000100001001011000001000000010010000100001000100110000 +011101101010011100100111001010001100100000111001011111001101000100000 +000000010010101000000000000011000011000011000000000001010101010100011 +001001001010110000100010010100111101010110000000000000010100011010000 +000100000010000001000010000111100110000111001000000000010110100110101 +000110001011000010100010000000000111010111010010100010100100011100001 +0100100000100010000111001110111101011000101001011000 506 : 1110001001101101101011101000111011000101011110100010001000110110 +001110000001000001000101100000000001011101111010000000001010010101010 +010110111100010010001001111001100101010000010000001100101000100111000 +011101110001111100000111101111001100111001100011101010000111000000001 +000101101100110001000000001101101100000001100000000001010011010001110 +000000011011100000110111101100010011010010001100010101000111001100000 +011010000000001000011000000001000100010000001010010000001100010110100 +000000001011000110100100010010000111110000101110111101101110011000001 +0111110000010010011111111110111111011000100100100000 517 : 1100000000001111100011101100011011000100001000100100000000000111 +000000100001001001000100000100000000001000000000000000100000000100000 +000000000000000000000000001000000001000000010000100000000000000001000 +011000001101011000000110000001001100100000100000010011100000100010000 +001000000100000011000000000000000011000001000000100001010010000010000 +000000001010001000100010110101111100010011010000000000000000001000000 +101000000000001010000000011000100101000000000010010000010000000100110 +000000000000000000000000010000000111110011101010010000001101000000010 +0001100000010000000101001111100100010110100100110000 525 : 1100000000000000000011100000011000001100001000100000010001000100 +000000000000001010000100000000000000011100001000000000000000000001000 +000000001000000000000100001100001001000000011000110000001000100010000 +011000000001011101011110101010001100100000110000000111010000000001000 +000000001000000001100000000100000110000101000000101001010001010001010 +000000001010100000100010100001110000010100000011000100100001000000001 +000000001100000000000010000000000100100000000001010001100001101100110 +000010001000000010000000000000000111000100110000100001000100010100100 +0001101110010000000101101100010100111000111000001001 557 : 1100000100000000000011101000011011001100000000100000000000000110 +111000100001000000000001000000000000001000101000000000010000000100000 +000000010000000000000000000000100011100100110000001000001000111000000 +011000000000011100000111001010001100101100100011000011000000000010000 +000010010101101000000001000000100000000001000001000000100011010001100 +000011001010101011110010110100111000000000001100001000110000000001000 +010000000000000000100000001000100100100000000000000000000000000100000 +000000000000000011000000000000000101000000000010000000000100000010101 +0101001000001010010111111100111100010110110100001100 All : 1111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +111111111111111111111111111111111111111111111111111111111111111111111 +1111111111111111111111111111111111111111111111111111
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^5: Help required in optimizing the output (PERL)
by choroba (Cardinal) on Jun 22, 2010 at 23:04 UTC | |
by BrowserUk (Patriarch) on Jun 23, 2010 at 00:52 UTC |