in reply to Help required in optimizing the output (PERL)

It should be possible to point this first-time poster in the right direction without actually writing the code. Let's see-

Let's assume the text to search is in a text file. Since randomid is treating the text as paragraphs, typing "perldoc perlvar" and reading about $INPUT_RECORD_SEPARATOR aka $/ will be instructive. Especially what happens when you set it to "\n\n".

And then read about $INPUT_LINE_NUMBER aka $. to see how to easily get the paragraph number.

Then read something like searching for keywords to get an idea of how to search for various keywords in text. Now getting all the useful info out of that thread will take a little more than a simple copy and paste, but one will learn a little in the process.

  • Comment on Re: Help required in optimizing the output (PERL)

Replies are listed 'Best First'.
Re^2: Help required in optimizing the output (PERL)
by BrowserUk (Patriarch) on Jun 22, 2010 at 03:29 UTC

    I suspect that the real challenge here is not so much discovering which paragraphs contain which keywords--though paragraph mode will certainly simplify the process.

    But rather, having discovered and recorded which paras contain which words and stored that information in some form; how to determine the minimum set of paras that contain all the keywords.

    And given the form in which the OP presented the question, I suspect that s/he is actually stuck for how to even begin to approaching that task.

    1. How to store the information about which words appear in which para.
    2. How to find the minimum set of paras that include all the keywords.

    It is an interesting problem that has many possible solutions, but only a few that will be relatively efficient.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      We can consider the words in a file to be each word in a separate line.. And in the other file containing paragraphs , we can consider each paragraph as a single line.. Each paragraph starts with > symbol.. So we can consider the occurrence of > symbol to be the start of new para.. I already have a hash which stores both the information a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.. The only challenge is to find the minimum set of paras that includes all the keywords. Help is greatly appreciated.
        a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.

        As worded, both those hashes contain the same information? Please clarify.

        Also, how many paras (not lines, or GB)? And how many (key)words?

        Finding the optimal solution is NP-hard--brute force is essentially the only way. But there are some efficient mechanisms for testing permutations; and others that can help prune the search space; but we need to know the scale of the parameters.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        This uses bitstrings to represent the keywords contained within each para. This means:

        • Storing the information about which of the 600,000 paras contain which of the 200 words requires 600,000 x (200/8) + hashing overheads ~= 150 MB. Doable on most systems.
        • Performing union, intersection/symmetric difference etc. set operations is very fast.

          My testcase looks for 599 words in 612 paragraphs and whittles it down to 34 paras in 16 seconds.

          Naively projecting that, it should deal with 200 words in 600,000 paras in 1 1/2 hours?

        It may make some sense to perform a second filtration to attempt to exclude further paras where a later inclusion covered the additions of an earlier inclusion. This might even be made iterative until no further exclusions can be made.

        For my testcase this refinement made no difference, so I removed the code.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.