in reply to Re: Re: Out of memory
in thread Out of memory

If your input data is such that just finding two or more Consecutive Capitalized Words is sufficient to identify proper names, then your logic can be very "local" as well as simple -- you wouldn't need more than two lines of text in memory at any one time with Perl. But avoiding false alarms and misses takes some careful thought. Aside from the two false alarms in my first two sentences, your current method would miss the next case completely. (Mike O'Conner, whoever he is, would be missed if he's mentioned at the beginning of a quotation, as well as a parenthetical. And what about George McGovern?)

Leaving that aside, this may be one of those rare cases where a "goto" statement is worth having. Consider:

use strict; my @tokens = (); my $properName = ""; while (<>) { push @tokens, split; FINDCAP: # the "until" loop skips non-capitalized tokens until ( @tokens == 0 or $tokens[0] =~ /^\W*[A-Z][\'A-Za-z]*\b/ ) { shift @tokens; } # the "while" loop accumulates consecutive capitalized toke +ns, # if any were found that caused us to break out of the "unt +il" while ( @tokens and $tokens[0] =~ /^\W*([A-Z][\'A-Za-z]*)\b/ ) { $properName .= $1 . " "; shift @tokens; } # go into the next block if there are still tokens left # (this means we haven't reached the end of this line) if ( @tokens ) { if ( $properName =~ / [A-Z]/ ) { print $properName,$/; # print if $properName has >1 word } $properName = ""; # reset to empty string goto FINDCAP; # look through the remainder of this line } # that block was skipped if there are no tokens left # so we loop back to the outer "while" loop to get the # next line, and append its words to @tokens # (and $properName, if not empty, remains intact # for appending the next Capitalized Token, if any) }
Because it ignores punctuation, this will do the wrong thing in a case like "I saw John Brown. He was dead." Good luck! (update: enhanced the final block of commentary in the code, and made the title more relevant)