Hello stupidstudent, and welcome to the Monastery!
Whether the assignment comes from your teacher, your boss, or your client, the first thing you need to do is to nail down the requirements:
- What is a “word”? Does it include punctuation characters? Is “123” a word? Can a single “word” extend across line endings (via hyphenation)? — or would that be two words, with the hyphen stripped out?
- How big is a “large text” file? Is it so large that the words won’t all fit into memory at once?
- Edge cases
- I guess we can assume that if a keyword occurs, say, 5 words into the file, then we’ll be satisfied to output just the 4 words before it? (Or would that render the sequence invalid?)
- Likewise, for a keyword that occurs near the end of the file?
- What happens if two keywords appear with fewer than 20 other words between them? Do we want to output two overlapping 21-word sequences, or one longer sequence containing both keywords?
Once you have a detailed specification for the project, the next step will be to design an algorithm. If the file really is very large, you might want to consider a sliding window approach.
When you have an algorithm (say, in pseudocode), you can post it here for feedback. Then (and assuming you also have some input data files of different sizes suitable for testing), you can begin to implement your algorithm in Perl code. If you run into trouble during that task, post your code here and the monks will help you fix or improve it.
Hope that helps,
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.