in reply to Help required in optimizing the output (PERL)

Depending on how many paragraphs you have, brute force may work:

If the number of paragraphs is very large, the number of combinations may exceed your processing capabilities. Like I said, brute force :-)

Alternate solution

Does this reduce to a form of the spanning tree problem? Perhaps if you would arrange set of graphs with two types of nodes (paragraphs - P, and words - W) where W are adjacent only to the containing P, and P is adjacent only to the W it contains, that you could

The sum of those depths (and the associated P nodes on those paths) should be your ideal list of paragraphs.

I think (still have not had coffee this AM).

Update: Given the updated information on the number of words and paragraphs, spanning tree may not be the best solution. Perhaps I will play with this and see if I can find the time to drop what is in my head down on paper.

Update 2: This algorithm is not suitable for large quantities of P from a big-O standpoint. If I have it right, it comes down to something along the lines of O(W * 2^P). The first pass on each connected graph would need to try all combinations of P, with a maximum depth of W. Perhaps the upper bound can be reduced, but not at my current coffee levels.

P.S. I have had problems this complex for homework. Perhaps not 100 level homework, but these problems do exist in upper level courses, and there are people who try to fake their way even at that level. So please, excuse some cynicism and a lack of desire from some to provide a full solution.

--MidLifeXis

  • Comment on Re: Help required in optimizing the output (PERL)

Replies are listed 'Best First'.
Re^2: Help required in optimizing the output (PERL)
by choroba (Cardinal) on Jun 22, 2010 at 15:53 UTC
    I have a different idea, not fully tested and elaborated yet. For each paragraph, hash all the keywords contained in it. Then sort the paragraphs according to the number of keywords they contain. For each paragraph starting from the "richest" one, check whether all its keywords are present in the remaining paragraphs. If yes, remove the paragraph.
      This may not completely work.. As there are about 600000 paragraphs and trying to eliminate it that way may not be very efficient. ( I havent tested it) And about 200 keywords. ( these are the maximum count we are considering ) Will look if brute force or spanning tree can be used. Thanks all for the help and also i understand the homework part however my comment was posted only for the part where somebody without knowing or assuming made some unnecessary comments.
        The brute force method mentioned above mentioned "for each combination of paragraphs".

        You mention 600000 paragraphs.

        The thought of 2^600000 iterations would definitely have me looking at the Spanning Tree solution.

        There are about 600000 paragraphs and about 200 words One hash contains the words and the paragraphs occurring as the values And the other hash for each paragraphs the words contained in them.