MiamiGenome has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Could one of the perl elders kindly provide some insight into this regex question.

I have an array of user-input terms which may be "concatenated" or not (see @terms). I would like the array to look like @separated_terms. Basically just cleave each element if it has an uppercase letter following a lower case.
# Input @terms = qw/Genetics Genomics phylogeny allele ChromosomeLocusLink geneExpression RasSignalTransductionPathway/; foreach my $words (@terms) { my @wordlist = $words =~ /(?:(.+?[a-z])([A-Z].+))+/g; } # Desired Result @separated_terms = qw/Genetics Genomics phylogeny allele Chromosome Locus Link gene Expression Ras Signal Transduction Pathway/;
---- Here are my attempts, using @wordlist to hold the substrings (several variations omitted) :
# does not work - removes last lowercase and first uppercase letter at + each boundary # my @wordlist = split /[a-z][A-Z]/, $words; # does not work - only separates the last term from the list my @wordlist = $words =~ /(?:(.+?[a-z])([A-Z].+))+/g;

Replies are listed 'Best First'.
Re: Regular Expression - split string by lower/upper case
by davidrw (Prior) on Apr 14, 2006 at 18:25 UTC
    It's because of the greedy .+ you have after [A-Z] ... just tweaking that to be [^A-Z] instead of . will work, and gives what you want with that test data. An alternative is to do a substitution, then split on whitespace.
    use strict; use warnings; while(<DATA>){ # split regex print join ":", grep length, split /(?:(.+?[a-z])([A-Z][^A-Z]+))/g, +$_; # substition then split method s/(?<=[a-z])(?=[A-Z])/ /g; print join(":", split ' ', $_), "\n"; } __DATA__ Genetics Genomics phylogeny allele ChromosomeLocusLink geneExpression RasSignalTransductionPathway
      I was going to do a substitution along the lines of s/([a-z])([A-Z])/$1 $2/g; but your look-behind/look-ahead is much neater and probably quicker. Something else new I have learned today.

      Thank you,

      JohnGG

      THANK YOU!! I used the substitution with positive lookahead and positive lookbehind. Worked like a charm -- not surprising, this is the Perl Monks! Cheers!