split with regex

Emanuel has asked for the wisdom of the Perl Monks concerning the following question:

fellow monks

I'm running into a problem issuing split invoking a regex. forgive me for the following bad regex, it's just something quick and dirty:

@lines = split(/(\d\d:\d\d:\d\d[a-zA-Z].*?[A-Z][A-Z][A-Z][A-Z]+)/,$dat
+a);
[download]

this is working fine for $data like

00:01:00Something here bla bla blaTYPE00:02:00Something here bla bla blaANOTHERTYPE00:03:00Something here bla bla blaEVENMORETYPES

however, it (obviously) doesn't work for something like

00:01:00Something here bla bla blaType00:02:00Something here bla bla b
+laTypetoooo00:03:00Something here bla bla blaTYPETHREE
[download]

I'm trying to build up a regex that lets split split at the digits of the next 'entry'. The outcome I need is

0 => 00:01:00Something here bla bla blaType
1 => 00:02:00Something here bla blablaTypetoooo
2 => 00:03:00Something here bla bla blaTYPETHREE
[download]

any help on this higly appreciated

regards
Emanuel

Comment on split with regex Select or Download Code

Replies are listed 'Best First'.
Re: split with regex by rasta (Hermit) on Nov 01, 2002 at 14:44 UTC
I believe this should be helpful: `split /(?=\d\d:\d\d:\d\d)/, $data;`	[reply] [d/l]
Re: Re: split with regex by Bird (Pilgrim) on Nov 01, 2002 at 15:58 UTC
rasta provides an elegant solution to your problem, if you need to use split. I tend to avoid fancy things like lookahead assertions if possible, if only to simplify maintenance. Also, since I was curious, I thought I'd benchmark our two solutions. `my $data = "00:01:00Something here bla bla blaTYPE00:02:00". "Something here bla bla blaANOTHERTYPE00:03:00S". "omething here bla bla blaEVENMORETYPES"; use Benchmark; timethese (100000, { withsplit => sub { my @lines = split /(?=\d\d:\d\d:\d\d)/, $data; }, nosplit => sub { my @lines = $data =~ /(\d{2}:\d{2}:\d{2}[^\d]*)/g; } } );` [download] ...gives me... `Benchmark: timing 100000 iterations of nosplit, withsplit... nosplit: 6 wallclock secs ( 7.09 usr + 0.00 sys = 7.09 CPU) @ 14 +104.37/s (n=100000) withsplit: 13 wallclock secs (13.85 usr + 0.00 sys = 13.85 CPU) @ 72 +20.22/s (n=100000)` [download] I don't know if it's split or the lookahead that's slowing things down, but I thought you might be interested in my results anyway. -Bird	[reply] [d/l] [select]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: split with regex by Bird (Pilgrim) on Nov 01, 2002 at 15:05 UTC
If you're trying to keep all the data you're matching, I don't know if split is the best solution. You should be fine just using a global match with capturing. Something like... `my $data = "00:01:00Something here bla bla blaTYPE00:02:00". "Something here bla bla blaANOTHERTYPE00:03:00S". "omething here bla bla blaEVENMORETYPES"; my $moredata = "00:01:00Something here bla bla blaType00:0". "2:00Something here bla bla blaTypetoooo00:". "03:00Something here bla bla blaTYPETHREE"; my @lines = $data =~ /(\d{2}:\d{2}:\d{2}[^\d])/g; my @morelines = $moredata =~ /(\d{2}:\d{2}:\d{2}[^\d])/g; print "$_\n" for @lines; print "\n"; print "$_\n" for @morelines;` [download] This assumes that the text between the digits won't contain any other digits. You may need to modify the `[^\d]*` portion of the regex if any digits may appear in the text section. From your examples, though, this appears to do what you need. -Bird	[reply] [d/l] [select]
Re: split with regex by Jaap (Curate) on Nov 01, 2002 at 14:06 UTC
How about this: `split (/(\d{2}\:\d{2}\:\d{2})/, $data);` [download]	[reply] [d/l]
Re: Re: split with regex by Emanuel (Pilgrim) on Nov 01, 2002 at 14:13 UTC
the problem here is that it rips off the leading digits, but I do need them.. and that's my dilemma, that I can't seem to find a proper regex for this.	[reply]
Re: Re: Re: split with regex by Jaap (Curate) on Nov 01, 2002 at 15:20 UTC
No it does not, because the regex is surrounded by (). The time is stored in a separate array element.	[reply]