Re: Help composing Regex for matching only Titlecase words
by CountZero (Bishop) on Mar 03, 2011 at 21:49 UTC
|
use Modern::Perl;
my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi
+re, England. Comment (lab): Collagen fraction used";
my @results = $data =~ /\b[[:upper:]][[:lower:]]*?\b/g;
say "@results";
Output: Antler South Street Avebury Wiltshire England Comment CollagenIf you feed it the string "Paroží zakotven v kopci na South Street, Avebury, Wiltshire, Anglie. Komentář (laboratoř): Kolagenní frakce používané" it will return "Paroží South Street Avebury Wiltshire Anglie Komentář Kolagenní". Update: And if you want to match "words" like "USA", "UK" or "Test5", "Body2Body" and such you can use /\b[[:upper:]][[:upper:][:lower:][:digit:]]*?\b/g as the regex.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [d/l] [select] |
|
|
Both you and fidesachates use stingy matching (that is, using ? as a secondary quantifier). I wonder why, as it's kind of pointless.
| [reply] |
|
|
You are right. It is not necessary. A space is neither upper, nor lowercase and therefore will limit the run of the regex.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] |
Re: Help composing Regex for matching only Titlecase words
by kennethk (Abbot) on Mar 03, 2011 at 21:39 UTC
|
There are two basic approaches you can take here
-
Capture all words that fit your requirement and return that list
-
Remove all words that do not fit your criterion, and return the result
In general, it is much easier to write positive regexes than negative ones, so I would use the second approach. I would do something like:
#!/usr/bin/perl
use strict;
use warnings;
my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi
+re, England. Comment (lab): Collagen fraction used";
my @result;
while ($data =~ /\b([A-Z][a-z]*)/g) {
push @result, $1;
}
print join(' ', @result), "\n";
YAPE::Regex::Explain explains this as
The regular expression:
(?-imsx:\b([A-Z][a-z]*))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
[A-Z] any character of: 'A' to 'Z'
----------------------------------------------------------------------
[a-z]* any character of: 'a' to 'z' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See perlretut for more details. | [reply] [d/l] [select] |
|
|
You anchor your words on just one side. Which means that if you have a sentence like "The USA is a UN member", you return "The U U". I did not get the impression that is what the OP wants.
| [reply] |
|
|
The OP does not include USA or UN as example text. Mine also gives results that are likely undesirable for character sequences that contain numbers or punctuation. Yours is likely no better on that front, though you did provide a disclaimer. Development of any regular expression depends strongly on what you are going to feed it - I think mine has the advantage of outputting more junk that yours would, making its weaknesses more obvious once the OP started putting it into practice.
| [reply] |
|
|
Re: Help composing Regex for matching only Titlecase words
by JavaFan (Canon) on Mar 03, 2011 at 21:40 UTC
|
$data = join " ", $data =~ /\b[A-Z][a-z]*\b/g;
This assumes you don't have accented letters (nor non-Western letters), and it may not do what you want for words like "Don't" and "O'Hara". | [reply] [d/l] |
Re: Help composing Regex for matching only Titlecase words
by fidesachates (Monk) on Mar 03, 2011 at 21:48 UTC
|
my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi
+re, England. Comment (lab): Collagen fraction used";
while($data =~ /([A-Z]\S+?)\s/g)
{
print $1." ";
}
Just remember that a lot of this contingent on your data, but that should work for any normal English paragraph with regular spacing rules.
UPDATE: I see a few monks pointing out words with two capital letters as a test case. This code will handle that as well since \S matches any non space character. It'll match "G!#^&*()-+_+234" if you gave it that. I actually tested that it works. | [reply] [d/l] |
|
|
Unfortunately, yours does not meet spec. The OP specifies a desired output of Antler South Street Avebury Wiltshire England Comment Collagen. Your regex outputs Antler South Street, Avebury, Wiltshire, England. Comment Collagen. It would also miss any trailing words, as in "My name is Mike.". As much as it seems like a common sense term, a 'word' is notoriously elusive from a CS perspective.
| [reply] [d/l] [select] |
Re: Help composing Regex for matching only Titlecase words
by Anonymous Monk on Mar 03, 2011 at 22:44 UTC
|
Fantastic help! Thanks a million; everyone...that did the trick, and I think I've learned a bit more regex, too. This is a quick and dirty script, I actually wanted to exclude the non-alpha chars, and I don't seem to have Modern::Perl, so kennethk's version worked for me.
Cheers,
Sean
| [reply] |