Splitting a string based on a regex

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to solve a problem with some code I've inherited. The code uses Regex::Assemble to create a compiled regex of some 400 words. The problem is that I need to be able to split the input record based on the matches found by the regex. Due to various issues (which I won't go into) I cannot change the way that the regex is being built via Regex::Assemble
Here is a sample of the input records

insert newtab values(1) drop table newtab
create table XXXX (field1 int null) insert XXXX values(1) grant select
+ on XXXX to sa drop table XXXX
create table XXXX (field1 int null) insert XXXX values(1) grant select
+ on XXXX to sa drop table XXXX rollback tran
create table XXXX (field1 int null) sp_help XXXX 
insert XXXX values(1) grant select on XXXX to sa drop table XXXX rollb
+ack tran
lock table cDsnJbgnd..smfJwDlwb in share mode
[download]

For this example the regex is built from this file

create table ([a-z]|[_])+
insert \b([a-z]|[_])+\b 
lock table
grant
[download]

This is what the regex produces

(?-xism:(?ig:(?:create table ([a-z]|[_])+|insert \b([a-z]|[_])+\b |loc
+k table|grant)))
[download]

Each record ($rec) is read in from a database via DBI::DBD and interrogated like this

# Build regex list
open KEYS,"patterns" or die "Can't open the pattern file: $!";
my $exp = Regexp::Assemble ->new(flags => 'ig',chomp => 1) ->add( <KEY
+S> );
close KEYS;

$exp->track( 1 );

while (get the database records) {

if ($exp->match($rec) ) {
      # populate a csv file for a spreadsheet
          } 
}
[download]

The problem is that some of the input records are multi-statement commands (i.e. they contain more than one matched pattern) and I need to be able to split the record up into it's constituent commands (where the split would be on a matched regex pattern) and process each bit seperately before printing to the csv file.
e.g. this record

create table XXXX (field1 int null) insert XXXX values(1)
[download]

Needs to be processed as

create table XXXX (field1 int null)
insert XXXX values(1)
[download]

Can somebody explain to me how to do this assuming I have to use Regex::Assemble to do the pattern match (and the sql to extract the records cannot be changed either !) ?
Ta

Comment on Splitting a string based on a regex Select or Download Code

Replies are listed 'Best First'.
Re: Splitting a string based on a regex by duff (Parson) on Feb 06, 2006 at 13:54 UTC
Um, if you have a regular expression and want to split a string based on that regular expression, that's what split does. Am I missing something? update: here's some code (untested): `# assuming you've got the RE in $re ... my @parts = split /(?=$re)/, $string;` [download] I used a zero-width look-ahead so that the text is split just before the part that matches as that seems to be your desire. duff	[reply] [d/l]
Re^2: Splitting a string based on a regex by Anonymous Monk on Feb 06, 2006 at 14:39 UTC
Hmm thanks for the reply but I'm not sure I understand how this will work. The regular expression held in $exp looks like this `(?-xism:(?ig:(?:create table ([a-z]\|[_])+(?{$self->{m}=1})\|insert \b([ +a-z]\|[_])+\b (?{$self->{m}=3})\|lock table(?{$self->{m}=0})\|grant(?{$s +elf->{m}=2}))))` [download] If I understand your answer correctly (it's highly plausible that I don't !) you are suggesting that I split the line based on this regex ? So that makes the suggested code look like this ? `my @parts = split /(?=(?-xism:(?ig:(?:create table ([a-z]\|[_])+(?{$sel +f->{m}=1})\|insert \b([a-z]\|[_])+\b (?{$self->{m}=3})\|lock table(?{$se +lf->{m}=0})\|grant(?{$self->{m}=2})))))/, $text;` [download] I'm only just past novice stage with Perl but that doesn't look right to me ?	[reply] [d/l] [select]
Re^3: Splitting a string based on a regex by duff (Parson) on Feb 06, 2006 at 15:37 UTC
Well, that would work but I was thinking that you'd get the RE from Regexp::Assemble in a more direct way. Here's a complete, tested example: #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; open KEYS,"patterns" or die "Can't open the pattern file: $!"; my $exp = Regexp::Assemble ->new(flags => 'i',chomp => 1) ->add( <KEYS +> ); close KEYS; my $re = $exp->re; # store the assembled RE for l +ater while (my $text = <DATA>) { chomp($text); my @parts = split /(?=$re)/, $text; # split based on stored RE for my $part (@parts) { print "line $.: $part\n"; } } exit; __DATA__ insert newtab values(1) drop table newtab create table XXXX (field1 int null) insert XXXX values(1) grant select + on XXXX to sa drop table XXXX create table XXXX (field1 int null) insert XXXX values(1) grant select + on XXXX to sa drop table XXXX rollback tran create table XXXX (field1 int null) sp_help XXXX insert XXXX values(1) grant select on XXXX to sa drop table XXXX rollb +ack tran lock table cDsnJbgnd..smfJwDlwb in share mode [download] I didn't write this earlier because I hadn't installed or read the docs to Regexp::Assemble yet. Also, I removed the `/g` flag that you passed to `Regexp::Assemble->new` as perl told me that it was a useless use. duff	[reply] [d/l] [select]
Re^4: Splitting a string based on a regex by Anonymous Monk on Feb 07, 2006 at 09:00 UTC
Re^3: Splitting a string based on a regex by Anonymous Monk on Feb 06, 2006 at 15:33 UTC
As I suspected it's my lack of knowledge that's the problem. I figured it out.	[reply]