Match non-capitalized words at the beginning of each sentence

WarrenBullockIII has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Regular Expressions by Zaxo (Archbishop) on Jun 09, 2002 at 02:58 UTC
There are lots of ways, not all equivalent. Unicode, for instance, makes the question a lot more complicated, as does locales. For plain declarative sentences in ASCII, like you are parsing for, `while(<>){ if(/\.\s+([A-Z]+)/){ print "$1 is capitalized\n"; } }` [download] will pick up capitolized initial words for all but the first sentence and sentences starting after a `$/` [usu. linebreak]. The first sentence has no preceding period. That also ignores the possibility of text with dialogue, interrogation points, exclamation points, ellipsis, ... There is a CPAN module, Lingua::EN::Sentence, which may be useful to you. The perl functions uc, lc, ucfirst, and lcfirst are handy for this kind of comparison. Accepting the sentence matching of your example for simplicity, `if ( /\.\s+(\w+)/ and $1 eq ucfirst($1) ) { print "That'un's Ok", $/; }` [download] That re picks the first word after a period and whitespace. To solve the sentence after linebreak problem, you can either slurp the entire file by `local $/=undef;` and m//s, or else match terminal periods and keep state in some variable for the next line. After Compline, Zaxo	[reply] [d/l] [select]
Re: Regular Expressions by %mick (Friar) on Jun 09, 2002 at 03:11 UTC
I see three cases to deal with: The first sentence, which has no preceding punctuation-whitespace. A "regular" sentence that has the punctuation-whitespace-word sequence. A regular sentence that happens to start on a new line. Assuming the file is not so huge that we can't slurp it in I'd take the easy way out of number three and undef the input record separator (`undef $/;`). The first two could be handled separately `#!/usr/bin/perl -w use strict; @ARGV = '/Perl/LearningPerl/Test'; local $/ = ''; for(<>) { if(/^([a-z]+)/) { print "\'$1\' is not capitalized\n"; } while(/[:.?!]\s+([a-z]\w*)/g){ print "\'$1\' is not capitalized\n"; } }` [download]	[reply] [d/l] [select]
(jeffa) Re: Match non-capitalized words at the beginning of each sentence (was: Regular Expressions) by jeffa (Bishop) on Jun 09, 2002 at 02:21 UTC
(updated node) Actually, ignore me and just read what Zaxo and %mick have to say. Hmmmm, this will get all non-capitalized words at the beginning of each sentance except the very first sentance and store them into an array: `my $s = 'hello world! how are you? whoops, forgot.'; my @no_caps = $s =~ /[.?!]\s((?![A-Z])[a-z]\w+)/g;` [download] Better to replace them on the spot says me: `$s =~ s/^((?![A-Z])[a-z]\w+)/ucfirst$1/e; $s =~ s/([.?!]\s)((?![A-Z])[a-z]\w+)/$1.ucfirst$2/eg;` [download] The first regex gets the first word of the string, the second takes care of the rest. Putting this back into your original code we get: `use strict; @ARGV = '/Perl/LearningPerl/Test'; while(<>){ if (/^((?![A-Z])[a-z]\w+)/) { print "$1 is not capitalized\n"; } while (/([.?!]\s)((?![A-Z])[a-z]\w+)/g) { print "$2 is not capitalized\n"; } }` [download] And that's ugly. The first if catches the first word of the file, and the while loop takes care of the rest. And it is still broken, as newlines are the monkeywrench in this machine. Taking Zaxo's suggestion of slurping the entire file into a scalar will fix that (the error, not the ugliness): `my $file = do {local $/; <>}; if ($file =~ /^((?![A-Z])[a-z]\w+)/) { print "$1 is not capitalized\n"; } while ($file =~ /([.?!]\s)((?![A-Z])[a-z]\w+)/g) { print "$2 is not capitalized\n"; }` [download] Sorry for being too quick to respond. jeffa I shoulda waited for merlyn ...	[reply] [d/l] [select]
Re: Regular Expressions by erikharrison (Deacon) on Jun 09, 2002 at 02:41 UTC
Hmmmm . . .not sure how efficient this is, but this is how I'd do it. `$paragraph = shift; @sentences = split /[.!?]/; foreach (@sentences) { last unless /\w/; #Make sure that we have words print unless /^\s[A-Z]/; }` [download] Cheers, Erik Light a man a fire, he's warm for a day. Catch a man on fire, and he's warm for the rest of his life. - Terry Pratchet* Update:Fixed stupid bug.	[reply] [d/l]
Re: Match non-capitalized words at the beginning of each sentence by ismail (Acolyte) on Jun 10, 2002 at 06:19 UTC
$/ is (almost) your friend: `local $/; $/ = "."; open FH, "./testfile"; open OH, "> ./outputfile"; while (<FH>) { s/^(\W*)(\w)/$1\u$2/; print OH $_; } close FH; close OH;` [download] And use this as testfile (make your own spelling errors ;): `this is a test sentence. this is another one. we want to test wheter + or not this will make these sentences uppercase. another paratraph. and another sentence. An upercase one. a differetn kind of break. thisone is kind of difficult: www.goof.com.` [download] note that it doesn't handle the domain name very gracefully, as it will not handle any other instances of embedding periods within sentences as non-periods. However, I can't really think of many: domain names and URLs (and some emails: frank.crist@whatever). decimal numbers. dotted lists, such as this one. elipses (...) It also doesn't take into account quoted text: He wondered, "is this in context? possibly." So, possibly more clever than useful. If the input doesn't follow strict(ish) rules of grammar, you're hosed.	[reply] [d/l] [select]