Counting words in headlines

mooseboy has asked for the wisdom of the Perl Monks concerning the following question:

Good morning monks,
Hope the regex experts out there can help out on this one: I have a large file of international news stories and want to count how many stories there are from each country (doesn't have to be exact). The file is formatted like so:

Headline of story is here

Text of story is here 
Text of story is here 
Text of story is here 

Headline of story is here

Text of story is here 
Text of story is here 
Text of story is here
[download]

Eyeballing the file shows that most headlines do in fact have the country name in them, so it seems like OWTDI would be just to count the occurrences of country names in the headlines only, ignoring the text. How can I modify the regex in the following loop to do that?

while (<NEWS>) {
    foreach my $country (@countries) {
    $story_count{$country}++ if m/$country/gi;
    }
}
[download]

Thanks in advance, mooseboy

Comment on Counting words in headlines Select or Download Code

Replies are listed 'Best First'.
Re: Counting words in headlines by MarkM (Curate) on Feb 04, 2003 at 09:21 UTC
For an initial tempt, I would run with the following: Read the file in "paragraph" mode. Detect headlines by locating "paragraphs" that have only a single line of text. Store a word count for header lines into a hash. Note: Force lowercase as a canonical representation. Lookup each country in the hash to find the count. Note: Force lowercase. See above. Example: `# Maintain a word count for words found in header lines. my %header_words; # Read text in paragraph mode. $/ = ''; # Read one paragraph at a time. while (<NEWS>) { # Only consider paragraphs that contains a single line of text. if (/\A\s\S[^\r\n]\s*\z) { $header_words{lc $_}++ for /(\w+)/g; } } # For each country, obtain the word count. for my $country (@countries) { my $count = $header_words{lc $country} \|\| 0; print "$count $country\n"; }` [download]	[reply] [d/l]
Re: Re: Counting words in headlines by mooseboy (Pilgrim) on Feb 04, 2003 at 11:02 UTC
Thanks, seems to work nicely! PS: trailing slash missing from `(/\A\s[^\r\n]+\s\z)`	[reply] [d/l]