in reply to Counting words in headlines

For an initial tempt, I would run with the following:

  1. Read the file in "paragraph" mode. Detect headlines by locating "paragraphs" that have only a single line of text.
  2. Store a word count for header lines into a hash. Note: Force lowercase as a canonical representation.
  3. Lookup each country in the hash to find the count. Note: Force lowercase. See above.

Example:

# Maintain a word count for words found in header lines. my %header_words; # Read text in paragraph mode. $/ = ''; # Read one paragraph at a time. while (<NEWS>) { # Only consider paragraphs that contains a single line of text. if (/\A\s*\S[^\r\n]*\s*\z) { $header_words{lc $_}++ for /(\w+)/g; } } # For each country, obtain the word count. for my $country (@countries) { my $count = $header_words{lc $country} || 0; print "$count $country\n"; }

Replies are listed 'Best First'.
Re: Re: Counting words in headlines
by mooseboy (Pilgrim) on Feb 04, 2003 at 11:02 UTC

    Thanks, seems to work nicely!

    PS: trailing slash missing from (/\A\s*[^\r\n]+\s*\z)