Help composing Regex for matching only Titlecase words

seand has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help composing Regex for matching only Titlecase words by CountZero (Bishop) on Mar 03, 2011 at 21:49 UTC
This works: `use Modern::Perl; my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used"; my @results = $data =~ /\b[[:upper:]][[:lower:]]?\b/g; say "@results";` [download] Output: `Antler South Street Avebury Wiltshire England Comment Collagen` If you feed it the string "Paroží zakotven v kopci na South Street, Avebury, Wiltshire, Anglie. Komentář (laboratoř): Kolagenní frakce používané" it will return "Paroží South Street Avebury Wiltshire Anglie Komentář Kolagenní". Update: And if you want to match "words" like "USA", "UK" or "Test5", "Body2Body" and such you can use `/\b[[:upper:]][[:upper:][:lower:][:digit:]]?\b/g` as the regex. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^2: Help composing Regex for matching only Titlecase words by JavaFan (Canon) on Mar 05, 2011 at 23:34 UTC
Both you and fidesachates use stingy matching (that is, using ? as a secondary quantifier). I wonder why, as it's kind of pointless.	[reply]
Re^3: Help composing Regex for matching only Titlecase words by CountZero (Bishop) on Mar 06, 2011 at 17:04 UTC
You are right. It is not necessary. A space is neither upper, nor lowercase and therefore will limit the run of the regex. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re: Help composing Regex for matching only Titlecase words by kennethk (Abbot) on Mar 03, 2011 at 21:39 UTC
There are two basic approaches you can take here Capture all words that fit your requirement and return that list Remove all words that do not fit your criterion, and return the result In general, it is much easier to write positive regexes than negative ones, so I would use the second approach. I would do something like: `#!/usr/bin/perl use strict; use warnings; my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used"; my @result; while ($data =~ /\b([A-Z][a-z])/g) { push @result, $1; } print join(' ', @result), "\n";` [download] YAPE::Regex::Explain explains this as The regular expression: (?-imsx:\b([A-Z][a-z])) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \b the boundary between a word char (\w) and something that is not a word char ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [A-Z] any character of: 'A' to 'Z' ---------------------------------------------------------------------- [a-z]* any character of: 'a' to 'z' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] See perlretut for more details.	[reply] [d/l] [select]
Re^2: Help composing Regex for matching only Titlecase words by JavaFan (Canon) on Mar 03, 2011 at 21:44 UTC
You anchor your words on just one side. Which means that if you have a sentence like "The USA is a UN member", you return "The U U". I did not get the impression that is what the OP wants.	[reply]
Re^3: Help composing Regex for matching only Titlecase words by kennethk (Abbot) on Mar 03, 2011 at 21:54 UTC
The OP does not include USA or UN as example text. Mine also gives results that are likely undesirable for character sequences that contain numbers or punctuation. Yours is likely no better on that front, though you did provide a disclaimer. Development of any regular expression depends strongly on what you are going to feed it - I think mine has the advantage of outputting more junk that yours would, making its weaknesses more obvious once the OP started putting it into practice.	[reply]
Re^4: Help composing Regex for matching only Titlecase words by JavaFan (Canon) on Mar 04, 2011 at 08:30 UTC
Re: Help composing Regex for matching only Titlecase words by JavaFan (Canon) on Mar 03, 2011 at 21:40 UTC
Untested: `$data = join " ", $data =~ /\b[A-Z][a-z]*\b/g;` [download] This assumes you don't have accented letters (nor non-Western letters), and it may not do what you want for words like "Don't" and "O'Hara".	[reply] [d/l]
Re: Help composing Regex for matching only Titlecase words by fidesachates (Monk) on Mar 03, 2011 at 21:48 UTC
You could try this `my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used"; while($data =~ /([A-Z]\S+?)\s/g) { print $1." "; }` [download] Just remember that a lot of this contingent on your data, but that should work for any normal English paragraph with regular spacing rules. UPDATE: I see a few monks pointing out words with two capital letters as a test case. This code will handle that as well since \S matches any non space character. It'll match "G!#^&*()-+_+234" if you gave it that. I actually tested that it works.	[reply] [d/l]
Re^2: Help composing Regex for matching only Titlecase words by kennethk (Abbot) on Mar 03, 2011 at 22:38 UTC
Unfortunately, yours does not meet spec. The OP specifies a desired output of `Antler South Street Avebury Wiltshire England Comment Collagen`. Your regex outputs `Antler South Street, Avebury, Wiltshire, England. Comment Collagen`. It would also miss any trailing words, as in "My name is Mike.". As much as it seems like a common sense term, a 'word' is notoriously elusive from a CS perspective.	[reply] [d/l] [select]
Re: Help composing Regex for matching only Titlecase words by Anonymous Monk on Mar 03, 2011 at 22:44 UTC
Fantastic help! Thanks a million; everyone...that did the trick, and I think I've learned a bit more regex, too. This is a quick and dirty script, I actually wanted to exclude the non-alpha chars, and I don't seem to have Modern::Perl, so kennethk's version worked for me. Cheers, Sean	[reply]