in reply to Re: Help required in optimizing the output (PERL)
in thread Help required in optimizing the output (PERL)

I suspect that the real challenge here is not so much discovering which paragraphs contain which keywords--though paragraph mode will certainly simplify the process.

But rather, having discovered and recorded which paras contain which words and stored that information in some form; how to determine the minimum set of paras that contain all the keywords.

And given the form in which the OP presented the question, I suspect that s/he is actually stuck for how to even begin to approaching that task.

  1. How to store the information about which words appear in which para.
  2. How to find the minimum set of paras that include all the keywords.

It is an interesting problem that has many possible solutions, but only a few that will be relatively efficient.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re^2: Help required in optimizing the output (PERL)

Replies are listed 'Best First'.
Re^3: Help required in optimizing the output (PERL)
by randomid (Initiate) on Jun 22, 2010 at 15:12 UTC
    We can consider the words in a file to be each word in a separate line.. And in the other file containing paragraphs , we can consider each paragraph as a single line.. Each paragraph starts with > symbol.. So we can consider the occurrence of > symbol to be the start of new para.. I already have a hash which stores both the information a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.. The only challenge is to find the minimum set of paras that includes all the keywords. Help is greatly appreciated.
      a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.

      As worded, both those hashes contain the same information? Please clarify.

      Also, how many paras (not lines, or GB)? And how many (key)words?

      Finding the optimal solution is NP-hard--brute force is essentially the only way. But there are some efficient mechanisms for testing permutations; and others that can help prune the search space; but we need to know the scale of the parameters.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      This uses bitstrings to represent the keywords contained within each para. This means:

      • Storing the information about which of the 600,000 paras contain which of the 200 words requires 600,000 x (200/8) + hashing overheads ~= 150 MB. Doable on most systems.
      • Performing union, intersection/symmetric difference etc. set operations is very fast.

        My testcase looks for 599 words in 612 paragraphs and whittles it down to 34 paras in 16 seconds.

        Naively projecting that, it should deal with 200 words in 600,000 paras in 1 1/2 hours?

      It may make some sense to perform a second filtration to attempt to exclude further paras where a later inclusion covered the additions of an earlier inclusion. This might even be made iterative until no further exclusions can be made.

      For my testcase this refinement made no difference, so I removed the code.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        My script takes 25 secs on an old HP notebook or 10 secs on Mac Mini for 600 words, 612 paragraphs. I don't know how it scales though yet.