seand has asked for the wisdom of the Perl Monks concerning the following question:

Hi there, and thanks in advance for your help! I am trying to parse a long and messy string and simply return a list of only words that begin with a capital letter. This should be simple, but my regex skills appear to be lacking. So, for example:

my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used" # should return: "Antler South Street Avebury Wiltshire England Commen +t Collagen" This doesn't work: $data =~ s/[A-Z]{2,100}//; $data =~ s/[a-z]{2,100}//; $data =~ s/\s+//; print STDOUT $data, "\n";

Thanks a million, monks! -Sean

Replies are listed 'Best First'.
Re: Help composing Regex for matching only Titlecase words
by CountZero (Bishop) on Mar 03, 2011 at 21:49 UTC
    This works:
    use Modern::Perl; my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used"; my @results = $data =~ /\b[[:upper:]][[:lower:]]*?\b/g; say "@results";
    Output: Antler South Street Avebury Wiltshire England Comment Collagen

    If you feed it the string "Paroží zakotven v kopci na South Street, Avebury, Wiltshire, Anglie. Komentář (laboratoř): Kolagenní frakce používané" it will return "Paroží South Street Avebury Wiltshire Anglie Komentář Kolagenní".

    Update: And if you want to match "words" like "USA", "UK" or "Test5", "Body2Body" and such you can use /\b[[:upper:]][[:upper:][:lower:][:digit:]]*?\b/g as the regex.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Both you and fidesachates use stingy matching (that is, using ? as a secondary quantifier). I wonder why, as it's kind of pointless.
        You are right. It is not necessary. A space is neither upper, nor lowercase and therefore will limit the run of the regex.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Help composing Regex for matching only Titlecase words
by kennethk (Abbot) on Mar 03, 2011 at 21:39 UTC
    There are two basic approaches you can take here
    1. Capture all words that fit your requirement and return that list
    2. Remove all words that do not fit your criterion, and return the result

    In general, it is much easier to write positive regexes than negative ones, so I would use the second approach. I would do something like:

    #!/usr/bin/perl use strict; use warnings; my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used"; my @result; while ($data =~ /\b([A-Z][a-z]*)/g) { push @result, $1; } print join(' ', @result), "\n";

    YAPE::Regex::Explain explains this as

    The regular expression: (?-imsx:\b([A-Z][a-z]*)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- \b the boundary between a word char (\w) and something that is not a word char ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- [A-Z] any character of: 'A' to 'Z' ---------------------------------------------------------------------- [a-z]* any character of: 'a' to 'z' (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

    See perlretut for more details.

      You anchor your words on just one side. Which means that if you have a sentence like "The USA is a UN member", you return "The U U". I did not get the impression that is what the OP wants.
        The OP does not include USA or UN as example text. Mine also gives results that are likely undesirable for character sequences that contain numbers or punctuation. Yours is likely no better on that front, though you did provide a disclaimer. Development of any regular expression depends strongly on what you are going to feed it - I think mine has the advantage of outputting more junk that yours would, making its weaknesses more obvious once the OP started putting it into practice.
Re: Help composing Regex for matching only Titlecase words
by JavaFan (Canon) on Mar 03, 2011 at 21:40 UTC
    Untested:
    $data = join " ", $data =~ /\b[A-Z][a-z]*\b/g;
    This assumes you don't have accented letters (nor non-Western letters), and it may not do what you want for words like "Don't" and "O'Hara".
Re: Help composing Regex for matching only Titlecase words
by fidesachates (Monk) on Mar 03, 2011 at 21:48 UTC
    You could try this
    my $data = "Antler embedded in mound at South Street, Avebury, Wiltshi +re, England. Comment (lab): Collagen fraction used"; while($data =~ /([A-Z]\S+?)\s/g) { print $1." "; }
    Just remember that a lot of this contingent on your data, but that should work for any normal English paragraph with regular spacing rules.

    UPDATE: I see a few monks pointing out words with two capital letters as a test case. This code will handle that as well since \S matches any non space character. It'll match "G!#^&*()-+_+234" if you gave it that. I actually tested that it works.
      Unfortunately, yours does not meet spec. The OP specifies a desired output of Antler South Street Avebury Wiltshire England Comment Collagen. Your regex outputs Antler South Street, Avebury, Wiltshire, England. Comment Collagen. It would also miss any trailing words, as in "My name is Mike.". As much as it seems like a common sense term, a 'word' is notoriously elusive from a CS perspective.
Re: Help composing Regex for matching only Titlecase words
by Anonymous Monk on Mar 03, 2011 at 22:44 UTC

    Fantastic help! Thanks a million; everyone...that did the trick, and I think I've learned a bit more regex, too. This is a quick and dirty script, I actually wanted to exclude the non-alpha chars, and I don't seem to have Modern::Perl, so kennethk's version worked for me. Cheers, Sean