DamnDirtyApe has asked for the wisdom of the Perl Monks concerning the following question:

Anyone who's up to giving regex advice, please, read on. I'm a little shaky on the subject, and would appreciate some good tips.

At work, I am currently working with a number of PL/SQL files. We have a test database and a QA database, and I often need to switch my scripts from referencing one to the other. As PL/SQL doesn't seem to have a pleasant way of doing this, I'm hoping I can surround the database names with identifiers within C-style comments so that the SQL procedures will compile properly, and at the same time Perl can identify the database names and replace them.

As an example, I'd like to be able to write

SELECT * FROM MY_SCHEMA.MY_TABLE@/*<DATABASE>*/MY_DB/*</DATABASE>*/ MY, YOUR_SCHEMA.YOUR_TABLE@/*<DATABASE>*/YOUR_DB/*</DATABASE>*/ YOUR WHERE MY.JOIN_KEY = YOUR.JOIN_KEY

in my PL/SQL scripts, and have Perl be able to identify the database names.

Sound simple? Well, it seems to be...

To test how easy it was to extract the database names in this way, I made a small test script.

#! /usr/bin/perl -w use strict ; use warnings ; use diagnostics ; use Data::Dumper ; $|++ ; #--------------------------------------- print "1ST TRY\n------\n" ; my @db_list_one = () ; my $text_one = q( SELECT * FROM SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/ ) ; my $matches_one = 0 ; if ( $text_one =~ m/\/\*\s*<DATABASE>\s*\*\/(.*)\/\*\s*<\/DATABASE>\s*\*\//i ) { push( @db_list_one, $1 ) ; $matches_one++ ; } print Dumper( \@db_list_one ), "\n\n" ; #--------------------------------------- print "2ND TRY\n------\n" ; my @db_list_two = () ; my $text_two = q( SELECT * FROM SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/, SYNERGEN.SA_WORK_ORDER@/*<DATABASE>*/SGENTEST/*</DATABASE>*/ ) ; my $text_two_copy = $text_two ; my $matches_two = 0 ; while ( $text_two_copy =~ s/\/\*\s*<DATABASE>\s*\*\/(.*)\/\*\s*<\/DATABASE>\s*\*\///i ) { push( @db_list_two, $1 ) ; $matches_two++ ; } print Dumper( \@db_list_two ), "\n\n" ;

This gives me the following output:

1ST TRY ------ $VAR1 = [ 'SGENQA' ]; 2ND TRY ------ $VAR1 = [ 'SGENQA', 'SGENTEST' ];

which was exactly what I had hoped for. After working this all out, I'm left with two big questions about my code.

Thanks for reading. :-)


_______________
D a m n D i r t y A p e
Home Node | Email

Replies are listed 'Best First'.
Re: In search of regex advice
by dvergin (Monsignor) on Oct 28, 2001 at 05:10 UTC
    Holy tilting toothpicks, Batman! ...first some code:
    use strict; use Data::Dumper ; print "1ST TRY\n------\n" ; my @db_list_one = () ; my $text_one = q( SELECT * FROM SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/ ); my $matches_one = 0 ; my $start = '/\*<DATABASE>\*/'; my $end = '/\*</DATABASE>\*/'; if ( $text_one =~ m/$start(.*?)$end/i ) { push @db_list_one, $1; $matches_one++ ; } print Dumper( \@db_list_one ), "\n\n" ; #--------------------------------------- print "2ND TRY\n------\n" ; my @db_list_two = () ; my $text_two = q( SELECT * FROM SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/, SYNERGEN.SA_WORK_ORDER@/*<DATABASE>*/SGENTEST/*</DATABASE>*/ + ); my $matches_two = 0 ; while ( $text_two =~ /$start(.*?)$end/gi ) { push @db_list_two, $1; $matches_two++ ; } print Dumper( \@db_list_two ), "\n\n" ;
    Now a few comments:

    I place a premium on the human-readable quotient of code. Thus the separating out of the $start and $end vars. That goes a long way toward making the regexen easier to make sense of.

    The non-greedy .*? (with the question mark) prevents spanning multiple $start...$end pairs if they occur on one line. It is a bit safer than the dreaded dot-star.

    The question you pose about how to capture several strings satisfying the same regex is answered with the use of the /g modifier to the second regex. The context of the while loop conditional puts the regex in a scalar context -- which puts it in "progressive match" mode: it walks through the string returning a true value for each match. This allows you to capture each captured value one-at-a-time inside the while loop.

    Update: In a list context, the /g regex returns a list of all the values found. So you can replace your my declaration and the while loop with:

    my @db_list3 = $text_two =~ /$start(.*?)$end/gi; my $matches3 = @db_list3;
    And if you want to make yourself really nuts the next time you come back to this code and try to make sense of it, you could replace the entire mess with:
    my $text = 'whatever...'; my ($start, $end) = qw( /\*<DATABASE>\*/ /\*</DATABASE>\*/ ); my $matches = my @db_list = $text =~ /$start(.*?)$end/gi;
    But that would be tempting the fates wouldn't it.   ;-)

      Thanks a lot for your advice. I played with it, built on it a bit, and came up with this:

      #! /usr/bin/perl -w use strict ; use warnings ; use diagnostics ; use Data::Dumper ; $|++ ; my $sql_query = q( /*<QUERY>*/ SELECT /*<FIELD>*/ ASSET_ID /*</FIELD>*/ , /*<FIELD>*/ ASSET_DESC /*</FIELD>*/ , /*<FIELD>*/ ASSET_COST /*</FIELD>*/ FROM /*<FULL_TBL>*/ SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/ /*</FULL_TBL>*/, /*<FULL_TBL>*/ SYNERGEN.SA_WORK_ORDER@/*<DATABASE>*/SGENTEST/*</DATABASE>*/ /*</FULL_TBL>*/ WHERE /*<CONDITION>*/ ASSET_ID IS NOT NULL /*</CONDITION>*/ AND /*<CONDITION>*/ ASSET_DESC IS NOT NULL /*</CONDITION>*/ AND /*<CONDITION>*/ ASSET_COST > 100 /*</CONDITION>*/ /*</QUERY>*/ ) ; my @type_list = qw( FIELD DATABASE CONDITION FULL_TBL QUERY ) ; foreach my $type ( @type_list ) { my @list = &get_match_list( $type, $sql_query ) ; print Dumper( \@list ), "\n" ; } exit( 0 ) ; #----- F U N C T I O N S ---------------------------------------- sub start_tag { my $tag = shift ; my $start = "\\/\\*\\s*<$tag>\\s*\\*\\/" ; return $start ; } sub end_tag { my $tag = shift ; my $end = "\\/\\*\\s*<\\/$tag>\\s*\\*\\/" ; return $end ; } sub get_match_list { my ( $tag, $text ) = @_ ; my @match_list = () ; # create the start & end tags my ( $start, $end ) = ( &start_tag( $tag ), &end_tag( $tag ) ) ; # as long as you're finding tag pairs... while ( $text =~ m/$start\s*(.*?)\s*$end/gi ) { my $new_match = $1 ; # strip out any comments, and replace spaces on either end # with a single space (leaving line breaks alone) my $spc = $new_match =~ /[ \t]+\/\*.*?\*\/|\/\*.*?\*\/[ \t]+/ ? ' ' : '' ; $new_match =~ s/[ \t]*\/\*(.*?)\*\/[ \t]*/$spc/g ; # Strip any whitespace off the ends $new_match =~ s/^\s*(.*?)\s*$/$1/ ; push( @match_list, $new_match ) ; } return @match_list ; }

      The output looks like

      $VAR1 = [ 'ASSET_ID', 'ASSET_DESC', 'ASSET_COST' ]; $VAR1 = [ 'SGENQA', 'SGENTEST' ]; $VAR1 = [ 'ASSET_ID IS NOT NULL', 'ASSET_DESC IS NOT NULL', 'ASSET_COST > 100' ]; $VAR1 = [ 'SYNERGEN.SA_ASSET@SGENQA', 'SYNERGEN.SA_WORK_ORDER@SGENTEST' ]; $VAR1 = [];

      ...very nearly what i was aiming for, but I can't seem to get the regex to match across a bunch of lines. What do I need to do to make the QUERY tags match as intended?


      _______________
      D a m n D i r t y A p e
      Home Node | Email
        can't seem to get the regex to match across a bunch of lines
        I haven't fully examined your code, but the phrase above makes me think you should check out the /s regex modifier, which makes the wildcard '.' match a newline. Since you have '.*' in several of your regexes, its probably worth seeing if the /s modifier will make them behave the way you want them to.

        -Blake

        Use the /s regex modifier. It causes '.' to match newline (which it otherwise does not match):
        my $str = "abc\ndef"; $str =~ /(b.*e)/; print "1[$1]\n"; # 1[] Use of uninitialized value! $str =~ /(b.*e)/s; print "2[$1]\n"; # 2[bc\nde]
        Cheers!
        David