Re: Counting words in headlines

For an initial tempt, I would run with the following:

Read the file in "paragraph" mode. Detect headlines by locating "paragraphs" that have only a single line of text.
Store a word count for header lines into a hash. Note: Force lowercase as a canonical representation.
Lookup each country in the hash to find the count. Note: Force lowercase. See above.

Example:

# Maintain a word count for words found in header lines.
my %header_words;

# Read text in paragraph mode.
$/ = '';

# Read one paragraph at a time.
while (<NEWS>) {

    # Only consider paragraphs that contains a single line of text.
    if (/\A\s*\S[^\r\n]*\s*\z) {
        $header_words{lc $_}++ for /(\w+)/g;
    }
}

# For each country, obtain the word count.
for my $country (@countries) {
    my $count = $header_words{lc $country} || 0;
    print "$count $country\n";
}
[download]

Comment on Re: Counting words in headlines Download Code

Replies are listed 'Best First'.
Re: Re: Counting words in headlines by mooseboy (Pilgrim) on Feb 04, 2003 at 11:02 UTC
Thanks, seems to work nicely! PS: trailing slash missing from `(/\A\s[^\r\n]+\s\z)`	[reply] [d/l]