Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hello all, I am very new to perl usage and need help to split punctuation like comma, space and ampersand in a same sentence. For eg: sentence like: "earth, wind & fire". The final result which I need in an array sould contain elements like(I mean words after splitting): "earth",",","wind","&", and "fire". If I split spaces using split function then, I am not getting the result as stated above. help me. thanks.

Replies are listed 'Best First'.
Re: splitting punctuation in a text
by kyle (Abbot) on Aug 11, 2008 at 16:45 UTC

    If you have capturing parentheses in the regex in split, the stuff in there comes out in the resulting list.

    use Data::Dumper; my $s = 'earth, wind & fire'; my @out = split /\s*([,&])\s*/, $s; print Dumper \@out; __END__ $VAR1 = [ 'earth', ',', 'wind', '&', 'fire' ];

    Of course, make "[,&]" the class of punctuation you actually care about.

      This also might be a good modification-

      my @out = split /\s*([[:punct:]])\s*/, $s;

      Depending on input and what the OP needs in the end. I think the POSIX classes, like punct, came in with 5.6. Someone will correct me if that's not right.

Re: splitting punctuation in a text
by injunjoel (Priest) on Aug 11, 2008 at 16:59 UTC
    Greetings,
    I would suggest using a word boundary \b. for instance...
    #!/usr/bin/perl -w use strict; use Data::Dumper; my $line = "earth, wind & fire"; my @chunks = split /\b/, $line; print Dumper(\@chunks);
    Which produces what you want... well close. If you don't want the spaces around your punctuation marks this
    #need the grep to filter for truth! #basically checking if #the element is defined/filled-in/not-blank my @chunks = grep{$_} split /\b|\s/, $line;
    should do the trick.


    Update! Upon re-reading your post I am a bit unclear... Do you want to keep the punctuations or not?
    if not
    #either this my @no_punct = $line =~ /(\w+)/g; #or this my @no_punct = grep{$_}split /\W|\s/, $line;
    will work.

    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo
Re: splitting punctuation in a text
by swampyankee (Parson) on Aug 11, 2008 at 17:10 UTC

    Split takes a regex as its first argument, the string to be split as its second, and the maximum number of fields to be returned as the optional third. Write a capturing regex with the punctuation characters of interest, and, voila, there you have it.

    As a piece of de rigueur advice: read the article "How do I post a question effectively?".


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: splitting punctuation in a text
by toolic (Bishop) on Aug 11, 2008 at 17:18 UTC
    Before I saw kyle's solution, my first thought was to inject whitespace around each punctuation mark, then use split on just whitespace:
    use strict; use warnings; use Data::Dumper; my $str = 'earth, wind & fire'; $str =~ s/([,&])/ $1 /g; my @arr = split /\s+/, $str; print Dumper(\@arr); __END__ $VAR1 = [ 'earth', ',', 'wind', '&', 'fire' ];

    I definitely think kyle's is simpler (and better), but sometimes it's worth seeing a different approach.

Re: splitting punctuation in a text
by eff_i_g (Curate) on Aug 11, 2008 at 18:35 UTC
Re: splitting punctuation in a text
by Fletch (Bishop) on Aug 11, 2008 at 16:46 UTC

    So what have you tried? Show some code and you'll be more likely to get help (save from the benighted souls who'll go ahead and do your homework for you, but I digress . . .). See also How (Not) To Ask A Question.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: splitting punctuation in a text
by eosbuddy (Scribe) on Aug 11, 2008 at 18:13 UTC
    #!/usr/bin/perl use strict; use warnings; my $s = 'earth, wind & fire'; my @out = split /\b/, $s; print "$_\n" foreach (@out);
    and if you just needed the words without the punctuation marks:
    foreach (@out) { print "$_\n" if ($_ =~ /\w/); }