splitting headache

Wibble has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: splitting headache by Ido (Hermit) on Feb 26, 2002 at 13:47 UTC
I'd use a regex like: `print "$_\n" for "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"=~/(' +.+?'\|".+?"\|[^\.]+).?/g` [download] Update: Or maybe `/('(?:[^'\\]\|\\.)+'\|"(?:[^"\\]\|\\.)+"\|[^\.]+).?/g` [download]	[reply] [d/l] [select]
Re: Re: splitting headache by Wibble (Beadle) on Feb 26, 2002 at 14:30 UTC
Thanks a lot for the answer -- this is perfect. The first regex works just fine for me and seems to handle nested "''" and '""' ok too, which is an advantage. You seem like the regex king; the 2nd regex looks very impressive -- erm what does it do?	[reply]
Re: Re: Re: splitting headache by Ido (Hermit) on Feb 26, 2002 at 15:14 UTC
The second regex is for escaping quotes with '. Like: 'Blah.Blah\'blah'. I'm not sure about it tho...	[reply]
Re: splitting headache by IlyaM (Parson) on Feb 26, 2002 at 13:48 UTC
I think it is impossible to use this format with split. Either simplify format or use something else. This format looks like a CSV with a dot as record separator. You could just use Text::CSV_XS: `my $data = "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"; use Text::CSV_XS; my $csv = Text::CSV_XS->new({ sep_char => '.', quote_char => "'" }); $csv->parse($data) or die "Cannot parse data"; my @fields = $csv->fields;` [download] -- Ilya Martynov (http://martynov.org/)	[reply] [d/l]
Re: Re: splitting headache by Wibble (Beadle) on Feb 26, 2002 at 14:33 UTC
Thanks for the answer. This also works great and, although not as fast as the regex method, it is probably easier for regex ludites like myself to get to grips with. Thanks again.	[reply]
•Re: splitting headache by merlyn (Sage) on Feb 26, 2002 at 14:59 UTC
My rule of thumb is that whenever it is easier to talk about what you want to keep than what you want to throw away, use `m//g` instead of `split`. You want to keep the contents of a quoted string, or a non-dot string. So say it that way: `$_ = "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"; my @keepers = grep defined $_, /'(.*?)'\|([^.]+)/g; print map "<$_>\n", @keepers;` [download] The `grep defined` is in there because on every hit, we'll get $1 as the quoted string but $2 undef, or $2 as the non-dotted string but $1 undef, and all we have to do is toss the undefs to get the final result. -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: splitting headache by strat (Canon) on Feb 26, 2002 at 13:40 UTC
I think you might better do so with a pattern-matching, e.g. `my @result = $string =~ /^([^\.]\.[^\.])(:?\.([^\.]))$/;` [download] or something the like. Or use split and some after-working: `my ($firstPart, @result) = split(/\./, $string); $result[0] = $firstPart . "." . $result[0];` [download] Best regards, perl -le "s==F=e=>y~\*martinF~stronat~=>s~[^\w]~~g=>chop,print"	[reply] [d/l] [select]
Re: Re: splitting headache by Wibble (Beadle) on Feb 26, 2002 at 14:45 UTC
Thanks for the reply. I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just a typical string that might occur. The 'word.groupings' would actually come anywhere, not just at the start. Thanks again though.	[reply]
Re: splitting headache by PrakashK (Pilgrim) on Feb 26, 2002 at 18:14 UTC
No need for an external module. Just use Text::ParseWords module (part of perl standard distribution): `use Text::ParseWords; my $line = "'Pugh.Pugh'.Barney.McGrew.Cuthbert.Dibble.Grub"; my @words = quotewords('\.', 0, $line);` [download] /prakash	[reply] [d/l]
Re: splitting headache by simon.proctor (Vicar) on Feb 26, 2002 at 13:59 UTC
Well I played with it for a while and this is the closest I could get. Personally I would recommend that you split on the '.' and then iterate over the current list keeping a track of what you have seen already. `print "$_\n" foreach (split(m/\G(([^\.]+)\.\2)\|(?:(?!\2\.)\.)/g, "Pugh +.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub"));` [download] There's probably some stuff in there I don't need but I'm not a regex master by any means :)	[reply] [d/l]
Re: Re: splitting headache by Wibble (Beadle) on Feb 26, 2002 at 14:36 UTC
Thanks for the reply. I guess I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just an example of a typical string that might occur. The 'word.groupings' could actually come anywhere, not just at the start, so this method won't work for me. Thanks again though.	[reply]
Re: splitting headache by Caillte (Friar) on Feb 26, 2002 at 14:01 UTC
A very quick and dirty way of doing this would be: `$line = "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub"; $line =~ s/\./\n/g; $line =~ s/([^\n])\n/$1./; print $line;` [download] This page is intentionally left justified.*	[reply] [d/l]
Re: Re: splitting headache by Wibble (Beadle) on Feb 26, 2002 at 14:41 UTC
Thanks for the reply. I guess I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just an example of a typical string that might occur. The 'word.groupings' could actually come anywhere, not just at the start, so this method also won't work for me. Thanks again though.	[reply]
Re: splitting headache by strat (Canon) on Feb 26, 2002 at 16:00 UTC
Some time ago, I've written the following code: #!perl -w use strict; my $file = "anyfile.txt"; my $sep = ';'; unless (open (CSV, $file)){ die "Error: $!\n"; } else { while (<CSV>){ next if $. == 1; # kill headline: dirty my @list = &ExtractFields($_, $sep); print join ":_:", @list; } # while close (CSV); } # else # ------------------------------------------------------------ sub ExtractFields { my ($string, $sep) = @_; my @csv = &FilterIndexList($string, $sep); my $start = 0; my @list = (); foreach my $j (@csv){ my $end = $j-1; # print "$start-$end "; push (@list, substr($string, $start, $end-$start+1)); $start = $j+1; } # foreach # filter leading and trailing " foreach (@list){ s/^\"(.)\"$/$1/; } # print join("(_\|_)", @list); return (@list); } # ExtractFields # ------------------------------------------------------------ sub FilterIndexList { my ($string, $sep) = @_; my @sep = &GetIndexList($string, $sep); my @hc = &GetIndexList($string, '"'); # try to find connected " and remove # the positions within from @sep my $i = 0; foreach (;;){ my ($start) = grep {$_ == $hc[$i]-1 } @sep; if ($start){ $i++; my ($end) = grep {$_ == $hc[$i]+1 } @sep; if ($end){ # print "found at $start-$end: $hc[$i]\n"; # kill positions in @sep within $start and $end @sep = grep { $_ <= $start or $_ >= $end } @sep; $i++; } else { # invalid end; throw away end and start over again splice(@hc, $i, 1); $i--; } } else { # invalid begin; throw away start splice(@hc, $i, 1); } last if $i > $#hc; # exit loop if no more " to test } return (@sep); } # FilterIndexList # ------------------------------------------------------------ # Return a list of incices of positions of $sep in $string sub GetIndexList { my ($string, $subStr) = @_; my @list = (); my $pos = -1; # startposition while (1){ # search for next $subStr $pos = index($string, $subStr, $pos+1); # if startposition again or not found, return last if $pos == -1; # else push found position onto list push (@list, $pos); } return (@list); } # GetIndexList # ------------------------------------------------------------ [download] It's not the best piece of code I've ever written, and I'm not sure if it works in all cases, but maybe it could help you... But it's a lot of code just for nearly nothing :-) Best regards, perl -le "s==F=e=>y~\*martinF~stronat~=>s~[^\w]~~g=>chop,print"	[reply] [d/l]
Re: splitting headache by Rich36 (Chaplain) on Feb 26, 2002 at 14:18 UTC
To use split, you can reverse the string with the names, only split enough times (6) so that the last string contains the '.', reverse the order of the elements in the foreach loop, then (un)reverse the string when you print it. `my @names = split(/\./, reverse("Pugh.Pugh.Barney.McGrew.Cuthbert.Dibb +le.Grub"), 6); foreach(reverse(@names)) {print reverse . "\n";} __RESULT__ Pugh.Pugh Barney McGrew Cuthbert Dibble Grub` [download] Rich36 There's more than one way to screw it up...	[reply] [d/l]
Re: Re: splitting headache by Wibble (Beadle) on Feb 26, 2002 at 14:43 UTC
Thanks for the reply. I guess I should have explained that my "Pugh.Pugh.Barney.McGrew.Cuthbert.Dibble.Grub" string was just an example of a typical string that might occur. The 'word.groupings' could actually come anywhere, not just at the start, so this method won't work for me too. Thanks again though.	[reply]
Re: splitting headache by Gyro (Monk) on Feb 27, 2002 at 20:37 UTC
Hey Wibble, In the following example I created another duplicate pair for show. The code looks at it as if the duplicates are not together, in otherwords I am assuming this could happen. `my $string = (join(".", sort split(/\./, "Pugh.Barney.Pugh.Barney.McGr +ew.Cuthbert.Dibble.Grub"))); my @array = ($string =~ m/(\S+)\.\1/g); # Capture duplicate $string =~ s/(\S+)\.\1//g; # Eliminate duplicates $string =~ s/\./\n/g; # Change \. to \n foreach (@array) { # Print duplicates print "\n$_.$_"; } print "$string"; # Print the rest of the list` [download] If you need to keep the string together you can use .= in the foreach loop and add ' around the dups as well. The following will add single quotes around the dups. `$string =~ s/(\S+)\.\1/'$1.$1'/g; # Add quotes and plugged into merlyn's reply $_ = join(".", sort split(/\./, "Pugh.Barney.Pugh.Barney.McGrew.Cuthbe +rt.Dibble.Grub")); $_ =~ s/(\S+)\.\1/'$1.$1'/g; # Add quotes my @keepers = grep defined $_, /'(.*?)'\|([^.]+)/g; print map "$_\n", @keepers;` [download] Gyro	[reply] [d/l] [select]