Isanchez has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

If you have running text such as:

Douglas built five Douglas World Cruisers to attempt his first flight to Buenos Aires. These were the predecesors of the modern AH-64D and AH-64D Apache.

... and you need to grab the following clusters of capitalized words:

Douglas

Douglas World Cruisers

Buenos Aires

AH-64D

AH-64D Apache

What would be a good way of grabing the largest cluster of capitalized words?

thank you,

Isanchez

Replies are listed 'Best First'.
Re: Capitalization Clusters
by dws (Chancellor) on Aug 06, 2003 at 00:41 UTC
    What would be a good way of grabing the largest cluster of capitalized words?

    Here's a different strategy, based on using split() to separate a string into pieces using non-capitalized words as separators, then discarding empty parts. This scheme needs some tweaking to honor sentence boundaries, but the general trick is a useful one in situations like this.

    #!/usr/bin/perl -w use strict; my $source = join('', <DATA>); my @capgroups = grep { $_ } split(/(?:^|\s+)(?:[^A-Z]\S*\s*)+/, $sourc +e); foreach ( @capgroups ) { print "$_\n"; } __DATA__ Douglas built five Douglas World Cruisers to attempt his first flight +to Buenos Aires. These were the predecesors of the modern AH-64D and AH-64D Apache.
Re: Capitalization Clusters
by traveler (Parson) on Aug 05, 2003 at 22:52 UTC
    Is "Buenos Aires. These" a "cluster of capitalized words"? If not is the period the issue or the end of line?

      And how does the script differentiate between "These" or "Douglas" at the beginning of a sentence? One is a proper noun, one is just at the start of the sentence, but there is no way to clearly tell.

      </ajdelore>

Re: Capitalization Clusters
by LameNerd (Hermit) on Aug 05, 2003 at 23:03 UTC
    I think this almost what you want ...
    #!/usr/bin/perl -w use strict; $_ = <DATA>; my @matches = /([A-Z].*? )[a-z].*?/g; my $biggest = 0; my $biggestCluster; for my $m ( @matches ) { print "$m\n"; $_ = $m; my @cs = /([A-Z].*? )/g; if ( $#cs > $biggest ) { $biggestCluster = $m; $biggest = $#cs; } } print "Biggest Cluster is [$biggestCluster]\n"; __DATA__ Douglas built five Douglas World Cruisers to attempt his first flight +to Buenos Aires. There were the predecessor of the modern AH-64D and +AH-64D Apache.
    ... it has a problem picking up AH-64D Apache, but hopefully you will still find this somewhat helpful.
Re: Capitalization Clusters
by derby (Abbot) on Aug 05, 2003 at 23:11 UTC
    Well ... given all the good comments above, one brute force ugly way is:

    #/usr/bin/perl $sentence = "Douglas built five Douglas World Cruisers to attempt his +first flight to Buenos Aires. These were the predecesors of the moder +n AH-64D and AH-64D Apache."; # split on spaces and some common punctuations @words = split( /\s+|[.,\/\\;]/, $sentence ); # roll through the words and grab the cap words # and their position $cnt = 0; foreach( @words ) { push( @lol, [ $_, $cnt ] ) if /^[A-Z]/; $cnt++; } # print out the first match $prev = $lol[0]->[1]-1; print $lol[0]->[0]; shift( @lol ); # roll through the rest. print out a new line # if it's not the next word by count, space if it is foreach( @lol ) { print $prev != $_->[1]-1 ? "\n" : " "; print $_->[0]; $prev = $_->[1]; } print "\n";

    Like I said ... ugly.

    -derby

Re: Capitalization Clusters
by Abigail-II (Bishop) on Aug 05, 2003 at 22:20 UTC
    That depends. It seems like AH-64D is a word, even while it contains digits and a hyphen. So, you first have to define what a word is. Is don't a word? Two words? Is 64D a word? Is it capitalized?

    Abigail

Re: Capitalization Clusters
by bbfu (Curate) on Aug 06, 2003 at 18:59 UTC

    A slightly more straight-forward approach (IMO). It picks up 'These,' unless you uncomment the first line in the regexp (but then you run the risk of missing proper names at the begining of sentences; you still need to decide how you want to handle that).

    #!/usr/bin/perl -l use warnings; use strict; my $data = 'Douglas built five Douglas World Cruisers to attempt his f +irst flight to Buenos Aires. These were the predecesors of the modern + AH-64D and AH-64D Apache.'; # What do you call a capitalized "word"? my $cap_word = qr/[A-Z][\w-]*/; my @clusters = $data =~ / #(?<!\.\s) # Ignore words at begining of sentences? ( $cap_word # Capitalized word, followed by any number (?:\s+$cap_word)* # of other cap words (separated by spaces) ) /gx; # Update: Oh, you wanted the largest... print "Largest cluster: ", (sort { length $b <=> length $a } @clusters +)[0];

    bbfu
    Black flowers blossom
    Fearless on my breath