http://qs1969.pair.com?node_id=491903


in reply to Splitting strings into words when there are no separators

Your employer (or whoever) should consider relaxing to http://green-brick.foobar and http://foo.bar/green-brick. A hyphen is perfectly valid in a server name, and saves you a lot of trouble.

However, there are ways to do what you ask. Here's a very naive approach:

#!/usr/bin/perl -l use strict; use warnings; # build word list in %words my %words; open my $dict, "/usr/dict/words"; chomp, $words{lc $_} = 1 while <$dict>; # UPDATE: added 'lc' close $dict; my $str = "perfumesmellslikecheese"; $str =~ m{ ^ # anchor to beginning of string (?{ [ ] }) # start $^R as an empty array ref (?: # match this block << (\w{2,}) # capture 2 or more letters to $1 (?(?{ $words{lc $1} }) # if lowercase '$1' is in %words... (?{ [ @{$^R}, $1 ] }) # add this word to the current list | # otherwise... (?!) # fail (force \w{2,} to backtrack) ) )+ # >> one or more times $ # anchor to end of string (?{ print "@{$^R}" }) # print the words (with spaces) (?!) # fail (cause everything to backtrack) }x;
You can make the engine a great deal smarter by making it dynamically adjust -- making it only possible to match things you KNOW to be words, for example.

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

Replies are listed 'Best First'.
Re^2: Splitting strings into words when there are no separators
by inman (Curate) on Sep 15, 2005 at 16:39 UTC
    This treatment compares each entry in the dictionary to the data and stores matching positions. All of the possible matches are then reconstructed in the printwords function.

    The words are read from the dictionary and compared without needing to be stored in memory.

Re^2: Splitting strings into words when there are no separators
by pingo (Hermit) on Jul 16, 2009 at 12:26 UTC
    Ok, so this may be four years late, but still... For the benefit of someone who stumbles across this.

    If you don't mind being English-specific, the (\w{2,}) bit could be extended slightly to (\w{2,}|[aAiI]). Makes "a" and "I" match too.