stupidstudent has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys, I am doing a project intending to identify a list of key words and extract say 10 words before and after each key word (including the key word) from a large text. I am very new to perl and not sure how to do this. I apologize if this question is very elementary. Thank you very very much in advance for any help! Best Regards

Replies are listed 'Best First'.
Re: Extracting text around Key Words
by Athanasius (Archbishop) on Aug 18, 2015 at 03:41 UTC

    Hello stupidstudent, and welcome to the Monastery!

    Whether the assignment comes from your teacher, your boss, or your client, the first thing you need to do is to nail down the requirements:

    • What is a “word”? Does it include punctuation characters? Is “123” a word? Can a single “word” extend across line endings (via hyphenation)? — or would that be two words, with the hyphen stripped out?
    • How big is a “large text” file? Is it so large that the words won’t all fit into memory at once?
    • Edge cases
      • I guess we can assume that if a keyword occurs, say, 5 words into the file, then we’ll be satisfied to output just the 4 words before it? (Or would that render the sequence invalid?)
      • Likewise, for a keyword that occurs near the end of the file?
      • What happens if two keywords appear with fewer than 20 other words between them? Do we want to output two overlapping 21-word sequences, or one longer sequence containing both keywords?

    Once you have a detailed specification for the project, the next step will be to design an algorithm. If the file really is very large, you might want to consider a sliding window approach.

    When you have an algorithm (say, in pseudocode), you can post it here for feedback. Then (and assuming you also have some input data files of different sizes suitable for testing), you can begin to implement your algorithm in Perl code. If you run into trouble during that task, post your code here and the monks will help you fix or improve it.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Extracting texts around Key Words
by Anonymous Monk on Aug 18, 2015 at 03:31 UTC