mooseboy has asked for the wisdom of the Perl Monks concerning the following question:

Good morning monks,
Hope the regex experts out there can help out on this one: I have a large file of international news stories and want to count how many stories there are from each country (doesn't have to be exact). The file is formatted like so:

Headline of story is here Text of story is here Text of story is here Text of story is here Headline of story is here Text of story is here Text of story is here Text of story is here

Eyeballing the file shows that most headlines do in fact have the country name in them, so it seems like OWTDI would be just to count the occurrences of country names in the headlines only, ignoring the text. How can I modify the regex in the following loop to do that?

while (<NEWS>) { foreach my $country (@countries) { $story_count{$country}++ if m/$country/gi; } }

Thanks in advance, mooseboy

Replies are listed 'Best First'.
Re: Counting words in headlines
by MarkM (Curate) on Feb 04, 2003 at 09:21 UTC

    For an initial tempt, I would run with the following:

    1. Read the file in "paragraph" mode. Detect headlines by locating "paragraphs" that have only a single line of text.
    2. Store a word count for header lines into a hash. Note: Force lowercase as a canonical representation.
    3. Lookup each country in the hash to find the count. Note: Force lowercase. See above.

    Example:

    # Maintain a word count for words found in header lines. my %header_words; # Read text in paragraph mode. $/ = ''; # Read one paragraph at a time. while (<NEWS>) { # Only consider paragraphs that contains a single line of text. if (/\A\s*\S[^\r\n]*\s*\z) { $header_words{lc $_}++ for /(\w+)/g; } } # For each country, obtain the word count. for my $country (@countries) { my $count = $header_words{lc $country} || 0; print "$count $country\n"; }

      Thanks, seems to work nicely!

      PS: trailing slash missing from (/\A\s*[^\r\n]+\s*\z)