Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm trying to solve a problem with some code I've inherited. The code uses Regex::Assemble to create a compiled regex of some 400 words. The problem is that I need to be able to split the input record based on the matches found by the regex. Due to various issues (which I won't go into) I cannot change the way that the regex is being built via Regex::Assemble
Here is a sample of the input records
insert newtab values(1) drop table newtab create table XXXX (field1 int null) insert XXXX values(1) grant select + on XXXX to sa drop table XXXX create table XXXX (field1 int null) insert XXXX values(1) grant select + on XXXX to sa drop table XXXX rollback tran create table XXXX (field1 int null) sp_help XXXX insert XXXX values(1) grant select on XXXX to sa drop table XXXX rollb +ack tran lock table cDsnJbgnd..smfJwDlwb in share mode
For this example the regex is built from this file
create table ([a-z]|[_])+ insert \b([a-z]|[_])+\b lock table grant
This is what the regex produces
(?-xism:(?ig:(?:create table ([a-z]|[_])+|insert \b([a-z]|[_])+\b |loc +k table|grant)))
Each record ($rec) is read in from a database via DBI::DBD and interrogated like this
# Build regex list open KEYS,"patterns" or die "Can't open the pattern file: $!"; my $exp = Regexp::Assemble ->new(flags => 'ig',chomp => 1) ->add( <KEY +S> ); close KEYS; $exp->track( 1 ); while (get the database records) { if ($exp->match($rec) ) { # populate a csv file for a spreadsheet } }
The problem is that some of the input records are multi-statement commands (i.e. they contain more than one matched pattern) and I need to be able to split the record up into it's constituent commands (where the split would be on a matched regex pattern) and process each bit seperately before printing to the csv file.
e.g. this record
create table XXXX (field1 int null) insert XXXX values(1)
Needs to be processed as
create table XXXX (field1 int null) insert XXXX values(1)
Can somebody explain to me how to do this assuming I have to use Regex::Assemble to do the pattern match (and the sql to extract the records cannot be changed either !) ?
Ta

Replies are listed 'Best First'.
Re: Splitting a string based on a regex
by duff (Parson) on Feb 06, 2006 at 13:54 UTC

    Um, if you have a regular expression and want to split a string based on that regular expression, that's what split does. Am I missing something?

    update: here's some code (untested):

    # assuming you've got the RE in $re ... my @parts = split /(?=$re)/, $string;

    I used a zero-width look-ahead so that the text is split just before the part that matches as that seems to be your desire.

      Hmm thanks for the reply but I'm not sure I understand how this will work.
      The regular expression held in $exp looks like this
      (?-xism:(?ig:(?:create table ([a-z]|[_])+(?{$self->{m}=1})|insert \b([ +a-z]|[_])+\b (?{$self->{m}=3})|lock table(?{$self->{m}=0})|grant(?{$s +elf->{m}=2}))))
      If I understand your answer correctly (it's highly plausible that I don't !) you are suggesting that I split the line based on this regex ?
      So that makes the suggested code look like this ?
      my @parts = split /(?=(?-xism:(?ig:(?:create table ([a-z]|[_])+(?{$sel +f->{m}=1})|insert \b([a-z]|[_])+\b (?{$self->{m}=3})|lock table(?{$se +lf->{m}=0})|grant(?{$self->{m}=2})))))/, $text;
      I'm only just past novice stage with Perl but that doesn't look right to me ?

        Well, that would work but I was thinking that you'd get the RE from Regexp::Assemble in a more direct way. Here's a complete, tested example:

        #!/usr/bin/perl use strict; use warnings; use Regexp::Assemble; open KEYS,"patterns" or die "Can't open the pattern file: $!"; my $exp = Regexp::Assemble ->new(flags => 'i',chomp => 1) ->add( <KEYS +> ); close KEYS; my $re = $exp->re; # store the assembled RE for l +ater while (my $text = <DATA>) { chomp($text); my @parts = split /(?=$re)/, $text; # split based on stored RE for my $part (@parts) { print "line $.: $part\n"; } } exit; __DATA__ insert newtab values(1) drop table newtab create table XXXX (field1 int null) insert XXXX values(1) grant select + on XXXX to sa drop table XXXX create table XXXX (field1 int null) insert XXXX values(1) grant select + on XXXX to sa drop table XXXX rollback tran create table XXXX (field1 int null) sp_help XXXX insert XXXX values(1) grant select on XXXX to sa drop table XXXX rollb +ack tran lock table cDsnJbgnd..smfJwDlwb in share mode

        I didn't write this earlier because I hadn't installed or read the docs to Regexp::Assemble yet. Also, I removed the /g flag that you passed to Regexp::Assemble->new as perl told me that it was a useless use.

        As I suspected it's my lack of knowledge that's the problem. I figured it out.