In search of regex advice

DamnDirtyApe has asked for the wisdom of the Perl Monks concerning the following question:

Anyone who's up to giving regex advice, please, read on. I'm a little shaky on the subject, and would appreciate some good tips.

At work, I am currently working with a number of PL/SQL files. We have a test database and a QA database, and I often need to switch my scripts from referencing one to the other. As PL/SQL doesn't seem to have a pleasant way of doing this, I'm hoping I can surround the database names with identifiers within C-style comments so that the SQL procedures will compile properly, and at the same time Perl can identify the database names and replace them.

As an example, I'd like to be able to write

SELECT
    *
FROM
    MY_SCHEMA.MY_TABLE@/*<DATABASE>*/MY_DB/*</DATABASE>*/ MY,
    YOUR_SCHEMA.YOUR_TABLE@/*<DATABASE>*/YOUR_DB/*</DATABASE>*/ YOUR
WHERE
    MY.JOIN_KEY = YOUR.JOIN_KEY
[download]

in my PL/SQL scripts, and have Perl be able to identify the database names.

Sound simple? Well, it seems to be...

To test how easy it was to extract the database names in this way, I made a small test script.

#! /usr/bin/perl -w

use strict ;
use warnings ;
use diagnostics ;
use Data::Dumper ;
$|++ ;

#---------------------------------------

print "1ST TRY\n------\n" ;

my @db_list_one = () ;
my $text_one = q(
    SELECT
        *
    FROM
        SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/
) ;

my $matches_one = 0 ;

if ( $text_one =~
    m/\/\*\s*<DATABASE>\s*\*\/(.*)\/\*\s*<\/DATABASE>\s*\*\//i )
{
    push( @db_list_one, $1 ) ;
    $matches_one++ ;
}

print Dumper( \@db_list_one ), "\n\n" ;

#---------------------------------------

print "2ND TRY\n------\n" ;

my @db_list_two = () ;
my $text_two = q(
    SELECT
        *
    FROM
        SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/,
        SYNERGEN.SA_WORK_ORDER@/*<DATABASE>*/SGENTEST/*</DATABASE>*/
) ;

my $text_two_copy = $text_two ;
my $matches_two = 0 ;

while ( $text_two_copy =~
    s/\/\*\s*<DATABASE>\s*\*\/(.*)\/\*\s*<\/DATABASE>\s*\*\///i )
{
    push( @db_list_two, $1 ) ;
    $matches_two++ ;
}

print Dumper( \@db_list_two ), "\n\n" ;
[download]

This gives me the following output:

1ST TRY
------
$VAR1 = [
          'SGENQA'
        ];


2ND TRY
------
$VAR1 = [
          'SGENQA',
          'SGENTEST'
        ];
[download]

which was exactly what I had hoped for. After working this all out, I'm left with two big questions about my code.

In the second example, I make a copy of the string, then strip out the matches until they are gone. Is there a better way to do this?
I was quite interested in the Death to Dot Star! thread, but between my wetware and the Camel book I couldn't find a way to adapt merlyn's or Ovid's regexes to do the job without things getting horribly messy. Is there a compact way to do this?

Thanks for reading. :-)

_______________
D a m n D i r t y A p e
Home Node | Email

Comment on In search of regex advice Select or Download Code

Replies are listed 'Best First'.
Re: In search of regex advice by dvergin (Monsignor) on Oct 28, 2001 at 05:10 UTC
Holy tilting toothpicks, Batman! ...first some code: use strict; use Data::Dumper ; print "1ST TRY\n------\n" ; my @db_list_one = () ; my $text_one = q( SELECT * FROM SYNERGEN.SA_ASSET@/<DATABASE>/SGENQA/</DATABASE>/ ); my $matches_one = 0 ; my $start = '/\<DATABASE>\/'; my $end = '/\</DATABASE>\/'; if ( $text_one =~ m/$start(.?)$end/i ) { push @db_list_one, $1; $matches_one++ ; } print Dumper( \@db_list_one ), "\n\n" ; #--------------------------------------- print "2ND TRY\n------\n" ; my @db_list_two = () ; my $text_two = q( SELECT FROM SYNERGEN.SA_ASSET@/<DATABASE>/SGENQA/</DATABASE>/, SYNERGEN.SA_WORK_ORDER@/<DATABASE>/SGENTEST/</DATABASE>/ + ); my $matches_two = 0 ; while ( $text_two =~ /$start(.?)$end/gi ) { push @db_list_two, $1; $matches_two++ ; } print Dumper( \@db_list_two ), "\n\n" ; [download] Now a few comments: I place a premium on the human-readable quotient of code. Thus the separating out of the $start and $end vars. That goes a long way toward making the regexen easier to make sense of. The non-greedy .? (with the question mark) prevents spanning multiple $start...$end pairs if they occur on one line. It is a bit safer than the dreaded dot-star. The question you pose about how to capture several strings satisfying the same regex is answered with the use of the /g modifier to the second regex. The context of the while loop conditional puts the regex in a scalar context -- which puts it in "progressive match" mode: it walks through the string returning a true value for each match. This allows you to capture each captured value one-at-a-time inside the while loop. Update: In a list context, the /g regex returns a list of all the values found. So you can replace your my declaration and the while loop with: `my @db_list3 = $text_two =~ /$start(.?)$end/gi; my $matches3 = @db_list3;` [download] And if you want to make yourself really* nuts the next time you come back to this code and try to make sense of it, you could replace the entire mess with: `my $text = 'whatever...'; my ($start, $end) = qw( /\<DATABASE>\/ /\</DATABASE>\/ ); my $matches = my @db_list = $text =~ /$start(.*?)$end/gi;` [download] But that would be tempting the fates wouldn't it. `;-)`	[reply] [d/l] [select]
Re: Re: In search of regex advice by DamnDirtyApe (Curate) on Oct 29, 2001 at 11:20 UTC
Thanks a lot for your advice. I played with it, built on it a bit, and came up with this: #! /usr/bin/perl -w use strict ; use warnings ; use diagnostics ; use Data::Dumper ; $\|++ ; my $sql_query = q( /<QUERY>/ SELECT /<FIELD>/ ASSET_ID /</FIELD>/ , /<FIELD>/ ASSET_DESC /</FIELD>/ , /<FIELD>/ ASSET_COST /</FIELD>/ FROM /<FULL_TBL>/ SYNERGEN.SA_ASSET@/<DATABASE>/SGENQA/</DATABASE>/ /</FULL_TBL>/, /<FULL_TBL>/ SYNERGEN.SA_WORK_ORDER@/<DATABASE>/SGENTEST/</DATABASE>/ /</FULL_TBL>/ WHERE /<CONDITION>/ ASSET_ID IS NOT NULL /</CONDITION>/ AND /<CONDITION>/ ASSET_DESC IS NOT NULL /</CONDITION>/ AND /<CONDITION>/ ASSET_COST > 100 /</CONDITION>/ /</QUERY>/ ) ; my @type_list = qw( FIELD DATABASE CONDITION FULL_TBL QUERY ) ; foreach my $type ( @type_list ) { my @list = &get_match_list( $type, $sql_query ) ; print Dumper( \@list ), "\n" ; } exit( 0 ) ; #----- F U N C T I O N S ---------------------------------------- sub start_tag { my $tag = shift ; my $start = "\\/\\\\s<$tag>\\s\\\\/" ; return $start ; } sub end_tag { my $tag = shift ; my $end = "\\/\\\\s<\\/$tag>\\s\\\\/" ; return $end ; } sub get_match_list { my ( $tag, $text ) = @_ ; my @match_list = () ; # create the start & end tags my ( $start, $end ) = ( &start_tag( $tag ), &end_tag( $tag ) ) ; # as long as you're finding tag pairs... while ( $text =~ m/$start\s(.?)\s$end/gi ) { my $new_match = $1 ; # strip out any comments, and replace spaces on either end # with a single space (leaving line breaks alone) my $spc = $new_match =~ /[ \t]+\/\.?\\/\|\/\.?\\/[ \t]+/ ? ' ' : '' ; $new_match =~ s/[ \t]\/\(.?)\\/[ \t]/$spc/g ; # Strip any whitespace off the ends $new_match =~ s/^\s(.?)\s*$/$1/ ; push( @match_list, $new_match ) ; } return @match_list ; } [download] The output looks like `$VAR1 = [ 'ASSET_ID', 'ASSET_DESC', 'ASSET_COST' ]; $VAR1 = [ 'SGENQA', 'SGENTEST' ]; $VAR1 = [ 'ASSET_ID IS NOT NULL', 'ASSET_DESC IS NOT NULL', 'ASSET_COST > 100' ]; $VAR1 = [ 'SYNERGEN.SA_ASSET@SGENQA', 'SYNERGEN.SA_WORK_ORDER@SGENTEST' ]; $VAR1 = [];` [download] ...very nearly what i was aiming for, but I can't seem to get the regex to match across a bunch of lines. What do I need to do to make the QUERY tags match as intended? _______________ D a m n D i r t y A p e Home Node \| Email	[reply] [d/l] [select]
Re: Re: Re: In search of regex advice by blakem (Monsignor) on Oct 29, 2001 at 11:41 UTC
can't seem to get the regex to match across a bunch of lines I haven't fully examined your code, but the phrase above makes me think you should check out the /s regex modifier, which makes the wildcard '.' match a newline. Since you have '.*' in several of your regexes, its probably worth seeing if the /s modifier will make them behave the way you want them to. -Blake	[reply]
Re: Re: Re: In search of regex advice by dvergin (Monsignor) on Oct 29, 2001 at 11:38 UTC
Use the /s regex modifier. It causes '.' to match newline (which it otherwise does not match): `my $str = "abc\ndef"; $str =~ /(b.e)/; print "1[$1]\n"; # 1[] Use of uninitialized value! $str =~ /(b.e)/s; print "2[$1]\n"; # 2[bc\nde]` [download] Cheers! David	[reply] [d/l]