randomid has asked for the wisdom of the Perl Monks concerning the following question:

Code / idea for this in PERL is greatly appreciated.

,p>There are two files one containing GRE words and another one containing some paragraphs with GRE words in it. We need to find an optimal method to display the output. Example as discussed

Para 1

Memorizing vocabulary lists is aberrant abscond advocate not the best way to improve your vocabulary. The best way is to read more and read more challenging material. However, if you only have a couple of months till you take the test, memorization might be your only realistic option. I hope you are seeing this page in your first, rather than your last year of college. If so, consider taking courses in a broad range of disciplines. Doing so will prepare you for the wide range of reading comprehension topics you will get on the GRE, GMAT, MCAT, and LSAT, and improve your vocabulary as well. Include a course in logic for good measure (MSU students can take PHL 130 -- a basic logic course). However, if you're already a senior, you might want to learn all the words on this list. They are not all that uncommon, and you might run into them elsewhere than on the GRE.

Para 2

At least two sections of the GRE, abscond, aberrant, analogies and antonyms, depend largely on the test-taker's knowledge of vocabulary. Many of the following words have appeared in the GRE exams (based upon previous exams which ETS has has released for students to practice with). Others are, in my opinion, good candidates to appear on the GRE. I can't guarantee that you will see all (or any) of these words on the GRE. No one can (legally) do that, as the GRE is secret. However, because of the difficulty of coming up with good antonym and analogy questions, it is likely that some words will be "recycled".

Para 3

The first column contains the vocabulary words, aberrant, arranged in alphabetical order. The second column gives the part(s) of speech (noun, verb, adjective, or adverb) to which the word belongs, and the third gives a bief definition. In some cases, the third column also includes examples of other forms of the listed word. Knowing the part of speech to which a word belongs can often help you analyze questions and answer choices on the verbal sections of the GRE and improve your chances of figuring out the correct answer. This is especially true for analogy questions. For more information on this subject, see the Suffixes page.

Para 4

As for learning the parts of speech, aggrandize rather than memorizing what part of speech each word belongs to, try to become more aware of what the most common parts of speech are and how the are used in sentences. For the purposes of the GRE, nouns, verbs, and adjectives are most useful. Consult a basic grammar handbook for explanations. Then, try to learn the vocabulary by putting the words into sentences. This is the best way to become more aware of how the words are used and will help you analyze GRE Analogies questions.

aberrant appears in Para 1, Para 2 and in Para 3

abscond appears in Para 1, Para 2

advocate appears in Para 1

aggrandize appears in Para 4

Here if we do not have para 2 and para 3 .. We will still have all the 4 words appear..

So We need to optimize it and give the output as

GRE words appear in Para 1, Para 4

  • Comment on Help required in optimizing the output (PERL)

Replies are listed 'Best First'.
Re: Help required in optimizing the output (PERL)
by Yary (Pilgrim) on Jun 22, 2010 at 01:43 UTC
    It should be possible to point this first-time poster in the right direction without actually writing the code. Let's see-

    Let's assume the text to search is in a text file. Since randomid is treating the text as paragraphs, typing "perldoc perlvar" and reading about $INPUT_RECORD_SEPARATOR aka $/ will be instructive. Especially what happens when you set it to "\n\n".

    And then read about $INPUT_LINE_NUMBER aka $. to see how to easily get the paragraph number.

    Then read something like searching for keywords to get an idea of how to search for various keywords in text. Now getting all the useful info out of that thread will take a little more than a simple copy and paste, but one will learn a little in the process.

      I suspect that the real challenge here is not so much discovering which paragraphs contain which keywords--though paragraph mode will certainly simplify the process.

      But rather, having discovered and recorded which paras contain which words and stored that information in some form; how to determine the minimum set of paras that contain all the keywords.

      And given the form in which the OP presented the question, I suspect that s/he is actually stuck for how to even begin to approaching that task.

      1. How to store the information about which words appear in which para.
      2. How to find the minimum set of paras that include all the keywords.

      It is an interesting problem that has many possible solutions, but only a few that will be relatively efficient.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        We can consider the words in a file to be each word in a separate line.. And in the other file containing paragraphs , we can consider each paragraph as a single line.. Each paragraph starts with > symbol.. So we can consider the occurrence of > symbol to be the start of new para.. I already have a hash which stores both the information a hash containing - which words are contained in a paragraph another hash containing - which paragraphs contains which words.. The only challenge is to find the minimum set of paras that includes all the keywords. Help is greatly appreciated.
Re: Help required in optimizing the output (PERL)
by MidLifeXis (Monsignor) on Jun 22, 2010 at 15:37 UTC

    Depending on how many paragraphs you have, brute force may work:

    • Find all keywords in your document and record the paragraph (perhaps a hash of arrays (HoA) would work for this)
    • For each combination of paragraphs, determine if you have all keywords
    • If you have all keywords, record the paragraphs necessary
    • If the number of paragraphs necessary is less than the current minimum, set the current minimum set to the current set of paragraphs.
    • ????
    • profit

    If the number of paragraphs is very large, the number of combinations may exceed your processing capabilities. Like I said, brute force :-)

    Alternate solution

    Does this reduce to a form of the spanning tree problem? Perhaps if you would arrange set of graphs with two types of nodes (paragraphs - P, and words - W) where W are adjacent only to the containing P, and P is adjacent only to the W it contains, that you could

    • for each graph
    • find the minimum depth to touch all W nodes (spanning tree)

    The sum of those depths (and the associated P nodes on those paths) should be your ideal list of paragraphs.

    I think (still have not had coffee this AM).

    Update: Given the updated information on the number of words and paragraphs, spanning tree may not be the best solution. Perhaps I will play with this and see if I can find the time to drop what is in my head down on paper.

    Update 2: This algorithm is not suitable for large quantities of P from a big-O standpoint. If I have it right, it comes down to something along the lines of O(W * 2^P). The first pass on each connected graph would need to try all combinations of P, with a maximum depth of W. Perhaps the upper bound can be reduced, but not at my current coffee levels.

    P.S. I have had problems this complex for homework. Perhaps not 100 level homework, but these problems do exist in upper level courses, and there are people who try to fake their way even at that level. So please, excuse some cynicism and a lack of desire from some to provide a full solution.

    --MidLifeXis

      I have a different idea, not fully tested and elaborated yet. For each paragraph, hash all the keywords contained in it. Then sort the paragraphs according to the number of keywords they contain. For each paragraph starting from the "richest" one, check whether all its keywords are present in the remaining paragraphs. If yes, remove the paragraph.
        This may not completely work.. As there are about 600000 paragraphs and trying to eliminate it that way may not be very efficient. ( I havent tested it) And about 200 keywords. ( these are the maximum count we are considering ) Will look if brute force or spanning tree can be used. Thanks all for the help and also i understand the homework part however my comment was posted only for the part where somebody without knowing or assuming made some unnecessary comments.
Re: Help required in optimizing the output (PERL)
by Yary (Pilgrim) on Jun 22, 2010 at 00:42 UTC
    Hello,

    First, a couple meta-comments. Perl not spelled in all upper case, it is either "Perl" the language or "perl" a program/an interpreter. (That would be a point missed on the Perl GRE!)

    Second, this should be posted in "Seekers of Perl Wisdom", not "Cool Uses for Perl". I imagine a monk with the appropriate responsibilities will move this thread soon. update: already moved!

    Will post an answer to the actual question in a little while, if no one beats me to it.

      Please don't post an answer. This post screams "please do my homework for me." And shows no effort whatsoever.

      Updated.

    A reply falls below the community's threshold of quality. You may see it by logging in.