Wibble has asked for the wisdom of the Perl Monks concerning the following question:

Need some help from a split() guru:

The following code segment...
print "$_\n" foreach (split(/\./, "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub")); ... produces the following list:
Pugh Pugh Barney McGrew Cuthbert Dibble Grub
I would like to alter the syntax of my string to group certain words together whilst keeping the embedded '.' character. i.e. Using quote characters to group the first two words: "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub", such that the list produced would be:
Pugh.Pugh Barney McGrew Cuthbert Dibble Grub

I can't think how to construct the split regex to achieve this?
Any ideas?

Replies are listed 'Best First'.
Re: splitting headache
by Ido (Hermit) on Feb 26, 2002 at 13:47 UTC
    I'd use a regex like:
    print "$_\n" for "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"=~/(' +.+?'|".+?"|[^\.]+).?/g
    Update: Or maybe
    /('(?:[^'\\]*|\\.)+'|"(?:[^"\\]*|\\.)+"|[^\.]+).?/g
      Thanks a lot for the answer -- this is perfect. The first regex works just fine for me and seems to handle nested "''" and '""' ok too, which is an advantage. You seem like the regex king; the 2nd regex looks very impressive -- erm what does it do?
        The second regex is for escaping quotes with '. Like: 'Blah.Blah\'blah'. I'm not sure about it tho...
Re: splitting headache
by IlyaM (Parson) on Feb 26, 2002 at 13:48 UTC
    I think it is impossible to use this format with split. Either simplify format or use something else.

    This format looks like a CSV with a dot as record separator. You could just use Text::CSV_XS:

    my $data = "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ sep_char => '.', quote_char => "'" }); $csv->parse($data) or die "Cannot parse data"; my @fields = $csv->fields;

    --
    Ilya Martynov (http://martynov.org/)

      Thanks for the answer. This also works great and, although not as fast as the regex method, it is probably easier for regex ludites like myself to get to grips with. Thanks again.
•Re: splitting headache
by merlyn (Sage) on Feb 26, 2002 at 14:59 UTC
    My rule of thumb is that whenever it is easier to talk about what you want to keep than what you want to throw away, use m//g instead of split. You want to keep the contents of a quoted string, or a non-dot string. So say it that way:
    $_ = "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"; my @keepers = grep defined $_, /'(.*?)'|([^.]+)/g; print map "<$_>\n", @keepers;
    The grep defined is in there because on every hit, we'll get $1 as the quoted string but $2 undef, or $2 as the non-dotted string but $1 undef, and all we have to do is toss the undefs to get the final result.

    -- Randal L. Schwartz, Perl hacker

Re: splitting headache
by strat (Canon) on Feb 26, 2002 at 13:40 UTC
    I think you might better do so with a pattern-matching, e.g.
    my @result = $string =~ /^([^\.]\.[^\.])(:?\.([^\.]))*$/;
    or something the like.

    Or use split and some after-working:

    my ($firstPart, @result) = split(/\./, $string); $result[0] = $firstPart . "." . $result[0];

    Best regards,
    perl -le "s==*F=e=>y~\*martinF~stronat~=>s~[^\w]~~g=>chop,print"

      Thanks for the reply. I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just a typical string that might occur. The 'word.groupings' would actually come anywhere, not just at the start. Thanks again though.
Re: splitting headache
by PrakashK (Pilgrim) on Feb 26, 2002 at 18:14 UTC
    No need for an external module. Just use Text::ParseWords module (part of perl standard distribution):
    use Text::ParseWords; my $line = "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"; my @words = quotewords('\.', 0, $line);
    /prakash
Re: splitting headache
by simon.proctor (Vicar) on Feb 26, 2002 at 13:59 UTC
    Well I played with it for a while and this is the closest I could get. Personally I would recommend that you split on the '.' and then iterate over the current list keeping a track of what you have seen already.
    print "$_\n" foreach (split(m/\G(([^\.]+)\.\2)|(?:(?!\2\.)\.)/g, "Pugh +.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub"));
    There's probably some stuff in there I don't need but I'm not a regex master by any means :)
      Thanks for the reply. I guess I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just an example of a typical string that might occur. The 'word.groupings' could actually come anywhere, not just at the start, so this method won't work for me. Thanks again though.
Re: splitting headache
by Caillte (Friar) on Feb 26, 2002 at 14:01 UTC

    A very quick and dirty way of doing this would be:

    $line = "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub"; $line =~ s/\./\n/g; $line =~ s/([^\n]*)\n/$1./; print $line;

    This page is intentionally left justified.

      Thanks for the reply. I guess I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just an example of a typical string that might occur. The 'word.groupings' could actually come anywhere, not just at the start, so this method also won't work for me. Thanks again though.
Re: splitting headache
by strat (Canon) on Feb 26, 2002 at 16:00 UTC
    Some time ago, I've written the following code:
    #!perl -w use strict; my $file = "anyfile.txt"; my $sep = ';'; unless (open (CSV, $file)){ die "Error: $!\n"; } else { while (<CSV>){ next if $. == 1; # kill headline: dirty my @list = &ExtractFields($_, $sep); print join ":_:", @list; } # while close (CSV); } # else # ------------------------------------------------------------ sub ExtractFields { my ($string, $sep) = @_; my @csv = &FilterIndexList($string, $sep); my $start = 0; my @list = (); foreach my $j (@csv){ my $end = $j-1; # print "$start-$end "; push (@list, substr($string, $start, $end-$start+1)); $start = $j+1; } # foreach # filter leading and trailing " foreach (@list){ s/^\"(.*)\"$/$1/; } # print join("(_|_)", @list); return (@list); } # ExtractFields # ------------------------------------------------------------ sub FilterIndexList { my ($string, $sep) = @_; my @sep = &GetIndexList($string, $sep); my @hc = &GetIndexList($string, '"'); # try to find connected " and remove # the positions within from @sep my $i = 0; foreach (;;){ my ($start) = grep {$_ == $hc[$i]-1 } @sep; if ($start){ $i++; my ($end) = grep {$_ == $hc[$i]+1 } @sep; if ($end){ # print "found at $start-$end: $hc[$i]\n"; # kill positions in @sep within $start and $end @sep = grep { $_ <= $start or $_ >= $end } @sep; $i++; } else { # invalid end; throw away end and start over again splice(@hc, $i, 1); $i--; } } else { # invalid begin; throw away start splice(@hc, $i, 1); } last if $i > $#hc; # exit loop if no more " to test } return (@sep); } # FilterIndexList # ------------------------------------------------------------ # Return a list of incices of positions of $sep in $string sub GetIndexList { my ($string, $subStr) = @_; my @list = (); my $pos = -1; # startposition while (1){ # search for next $subStr $pos = index($string, $subStr, $pos+1); # if startposition again or not found, return last if $pos == -1; # else push found position onto list push (@list, $pos); } return (@list); } # GetIndexList # ------------------------------------------------------------
    It's not the best piece of code I've ever written, and I'm not sure if it works in all cases, but maybe it could help you... But it's a lot of code just for nearly nothing :-)

    Best regards,
    perl -le "s==*F=e=>y~\*martinF~stronat~=>s~[^\w]~~g=>chop,print"

Re: splitting headache
by Rich36 (Chaplain) on Feb 26, 2002 at 14:18 UTC

    To use split, you can reverse the string with the names, only split enough times (6) so that the last string contains the '.', reverse the order of the elements in the foreach loop, then (un)reverse the string when you print it.

    my @names = split(/\./, reverse("Pugh.Pugh.Barney.McGrew.Cuthbert.Dibb +le.Grub"), 6); foreach(reverse(@names)) {print reverse . "\n";} __RESULT__ Pugh.Pugh Barney McGrew Cuthbert Dibble Grub

    Rich36
    There's more than one way to screw it up...

      Thanks for the reply. I guess I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just an example of a typical string that might occur. The 'word.groupings' could actually come anywhere, not just at the start, so this method won't work for me too. Thanks again though.
Re: splitting headache
by Gyro (Monk) on Feb 27, 2002 at 20:37 UTC
    Hey Wibble,
    In the following example I created another duplicate pair for show. The code looks at it as if the duplicates are not together, in otherwords I am assuming this could happen.
    my $string = (join(".", sort split(/\./, "Pugh.Barney.Pugh.Barney.McGr +ew.Cuthbert.Dibble.Grub"))); my @array = ($string =~ m/(\S+)\.\1/g); # Capture duplicate $string =~ s/(\S+)\.\1//g; # Eliminate duplicates $string =~ s/\./\n/g; # Change \. to \n foreach (@array) { # Print duplicates print "\n$_.$_"; } print "$string"; # Print the rest of the list
    If you need to keep the string together you can use .= in the foreach loop and add ' around the dups as well.
    The following will add single quotes around the dups.
    $string =~ s/(\S+)\.\1/'$1.$1'/g; # Add quotes and plugged into merlyn's reply $_ = join(".", sort split(/\./, "Pugh.Barney.Pugh.Barney.McGrew.Cuthbe +rt.Dibble.Grub")); $_ =~ s/(\S+)\.\1/'$1.$1'/g; # Add quotes my @keepers = grep defined $_, /'(.*?)'|([^.]+)/g; print map "$_\n", @keepers;
    Gyro