swampyankee++ for noticing the problem with abbreviations. Short of a the ability to parse and comprehend grammar, it's going to be very difficult to separate
"We sold the division to MegaTech, Ltd. in Asia last week,
who flipped the sale to someone else."
from
"We sold the division to MegaTech Industries. In Asia last week, they flipped the sale to someone else."
other than the fact that we are supposed to start a new sentence with an upper-case letter. There may be examples where that following word is a proper noun, however -- in which case it's going to be a
very hard nut to crack.
If, however, you only care about the "typical" case (because this is going to be a one-shot tool), you could:
- Split the text on /[.]\s+[A-Z]/ to get sentences.
- Grep the text for /[aA]sia/, or for /Asia\s/ if you don't want the word "asian" to count.
- Split the sentences that pass on ' ' to get words.
- Use the words you get from that split as keys to a hash, and increment a count in each bin.
Q.E.D.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.