isync has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Is there a simple way to split CamelCase back into words? I am looking for a way to tokenize filenames without running a dictionary against them.

How would I split:
ThisIsACamelCasedString

Any ideas? I see difficulties especially on splitting the "ACamelcase" part, where two capital letters join. Same as in "AStringWithFTP", so I don't end up with "f", "t", "p"...

Replies are listed 'Best First'.
Re: How to split CamelCase?
by salva (Canon) on Sep 18, 2007 at 09:57 UTC
    for (qw(CamelCase AStringWithFTP AStringWithFTPOrSFTP)) { my @parts = /[A-Z](?:[A-Z]+|[a-z]*)(?=$|[A-Z])/g; print "$_ => @parts\n"; } # outputs: # CamelCase => Camel Case # AStringWithFTP => A String With FTP # AStringWithFTPOrSFTP => A String With FTP Or SFTP
Re: How to split CamelCase?
by johngg (Canon) on Sep 18, 2007 at 10:08 UTC
    The best I can come up with is to use look-arounds for the spliting and then splice to concatenate single capitals. It is not perfect as you can see from the third data item.

    use strict; use warnings; my @strings = ( q{ThisIsACamelCasedString}, q{AStringWithFTP}, q{HeIsANASAAstronaut}, ); my $rxCamel = qr {(?x) (?<=[a-z])(?=[A-Z]) | (?<=[A-Z])(?=[A-Z]) }; foreach my $string ( @strings ) { print qq{String: $string\n}; my @words = split m{$rxCamel}, $string; for my $idx ( reverse 1 .. $#words ) { if ( $words[$idx] =~ m{^[A-Z]+$} and $words[$idx - 1] =~ m{^[A-Z]$} ) { $words[$idx - 1] .= splice @words, $idx, 1; } } print qq{ $_\n} for @words; }

    Here's the output.

    String: ThisIsACamelCasedString This Is A Camel Cased String String: AStringWithFTP A String With FTP String: HeIsANASAAstronaut He Is ANASA Astronaut

    I think there will be too many corner cases for this task to be be achieved without some comparison with perhaps a dictionary list of acronyms.

    I hope this is of use.

    Cheers,

    JohnGG

Re: How to split CamelCase?
by Sidhekin (Priest) on Sep 18, 2007 at 10:02 UTC

    First, since there is no seperator, I would not first look to split, but rather to m//g in list context:

    my $string = "ThisIsACamelCasedString"; my @split = $string =~ /([A-Z][a-z]*)/g; print "Parts: @split";

    (This version just skips any part that does not match /[A-Z][a-z]*/; season to taste.)

    Dealing with "AStringWithFTP" is trickier, at least without a dictionary. The only heuristic I may suggest is to treat sequences of upper-case characters as a word of its own if in final position or if preceding an uppercase-lowercase sequence:

    my $string = "AStringWithFTPAndHTTP"; my @split = $string =~ /([A-Z](?:[A-Z]*(?=$|[A-Z][a-z])|[a-z]*))/g; print "Parts: @split";

    That this is a heuristic is clear from noting that "ACString" is split as "AC", "String", and not as "A", "C", "String". But frankly, without a dictionary, I don't think there is a real solution.

    Update: Oops, pasted the wrong code. Fixed now. Also added prints to make it easier to test that it really is the code I meant to post ...

    Update2: Err, pasted the wrong right code too. Sorry, salva, I did not mean to steal it. /me *blushes* (Fixed now.)

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

Re: How to split CamelCase?
by CountZero (Bishop) on Sep 18, 2007 at 19:53 UTC
    I don't think I can improve on all the regexes suggested by my fellow Monks, but I find it somehow strange that for lowercase they all use [a-z] (and for uppercase [A-Z]) thereby entirely forgetting that there are a lot of other lower/uppercase characters around which fall outside of these characterclasses.

    Far easier and better to use [[:lower:]] and [[:upper:]] which will work with all lower and upper case characters.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: How to split CamelCase?
by isync (Hermit) on Sep 18, 2007 at 10:13 UTC
    Woah! That was quick! Thanks everyone for so much input!
Re: How to split CamelCase?
by girarde (Hermit) on Sep 18, 2007 at 13:42 UTC
    Simple and globally correct, no. You'd need a word list or an acronym list to be able to distinguish acronyms that start with A, I, or O from word pairs that start with A, I and O.

      Even with a dictionary, you cannot easily split out:

      ACStringContainingTheStandardACVoltageForTheCurrentCountry = "110V";

      You'd have to have a context sensitive system capable of understanding English for the really tough examples.

Re: How to split CamelCase?
by warddav (Initiate) on Sep 19, 2007 at 12:21 UTC
    I like salva's solution but would extend it to allow for the processing of camelcased strings that begin with lowercase letters, e.g., "thisIsValidToo". <more in a minute>
      Sorry....meant to include

      /[a-z]+|[A-Z](?:[A-Z]+|[a-z]*)(?=$|[A-Z])/g

      as possible regex.
Re: How to split CamelCase?
by isync (Hermit) on Feb 27, 2010 at 16:51 UTC
    Combining some of the above, and challenging the algo with real world strings that aren't plain camelcase but still require some camelcase parsing:
    for('TheTaoOfProgramming','NowA more,problematic stringExample','Make +The Girl Dance - Baby, Baby, Baby_(Audio replaced, horizontally flipp +ed!)_fmt35'){ next if $_ =~ /\W/; $_ =~ s/_/ /g; my @split = $_ =~ /[[:lower:]0-9]+|[[:upper:]0-9](?:[[:upper:]0-9] ++|[[:lower:]0-9]*)(?=$|[[:upper:]0-9])/g; $_ = "@split"; }