in reply to In search of regex advice

Holy tilting toothpicks, Batman! ...first some code:
use strict; use Data::Dumper ; print "1ST TRY\n------\n" ; my @db_list_one = () ; my $text_one = q( SELECT * FROM SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/ ); my $matches_one = 0 ; my $start = '/\*<DATABASE>\*/'; my $end = '/\*</DATABASE>\*/'; if ( $text_one =~ m/$start(.*?)$end/i ) { push @db_list_one, $1; $matches_one++ ; } print Dumper( \@db_list_one ), "\n\n" ; #--------------------------------------- print "2ND TRY\n------\n" ; my @db_list_two = () ; my $text_two = q( SELECT * FROM SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/, SYNERGEN.SA_WORK_ORDER@/*<DATABASE>*/SGENTEST/*</DATABASE>*/ + ); my $matches_two = 0 ; while ( $text_two =~ /$start(.*?)$end/gi ) { push @db_list_two, $1; $matches_two++ ; } print Dumper( \@db_list_two ), "\n\n" ;
Now a few comments:

I place a premium on the human-readable quotient of code. Thus the separating out of the $start and $end vars. That goes a long way toward making the regexen easier to make sense of.

The non-greedy .*? (with the question mark) prevents spanning multiple $start...$end pairs if they occur on one line. It is a bit safer than the dreaded dot-star.

The question you pose about how to capture several strings satisfying the same regex is answered with the use of the /g modifier to the second regex. The context of the while loop conditional puts the regex in a scalar context -- which puts it in "progressive match" mode: it walks through the string returning a true value for each match. This allows you to capture each captured value one-at-a-time inside the while loop.

Update: In a list context, the /g regex returns a list of all the values found. So you can replace your my declaration and the while loop with:

my @db_list3 = $text_two =~ /$start(.*?)$end/gi; my $matches3 = @db_list3;
And if you want to make yourself really nuts the next time you come back to this code and try to make sense of it, you could replace the entire mess with:
my $text = 'whatever...'; my ($start, $end) = qw( /\*<DATABASE>\*/ /\*</DATABASE>\*/ ); my $matches = my @db_list = $text =~ /$start(.*?)$end/gi;
But that would be tempting the fates wouldn't it.   ;-)

Replies are listed 'Best First'.
Re: Re: In search of regex advice
by DamnDirtyApe (Curate) on Oct 29, 2001 at 11:20 UTC

    Thanks a lot for your advice. I played with it, built on it a bit, and came up with this:

    #! /usr/bin/perl -w use strict ; use warnings ; use diagnostics ; use Data::Dumper ; $|++ ; my $sql_query = q( /*<QUERY>*/ SELECT /*<FIELD>*/ ASSET_ID /*</FIELD>*/ , /*<FIELD>*/ ASSET_DESC /*</FIELD>*/ , /*<FIELD>*/ ASSET_COST /*</FIELD>*/ FROM /*<FULL_TBL>*/ SYNERGEN.SA_ASSET@/*<DATABASE>*/SGENQA/*</DATABASE>*/ /*</FULL_TBL>*/, /*<FULL_TBL>*/ SYNERGEN.SA_WORK_ORDER@/*<DATABASE>*/SGENTEST/*</DATABASE>*/ /*</FULL_TBL>*/ WHERE /*<CONDITION>*/ ASSET_ID IS NOT NULL /*</CONDITION>*/ AND /*<CONDITION>*/ ASSET_DESC IS NOT NULL /*</CONDITION>*/ AND /*<CONDITION>*/ ASSET_COST > 100 /*</CONDITION>*/ /*</QUERY>*/ ) ; my @type_list = qw( FIELD DATABASE CONDITION FULL_TBL QUERY ) ; foreach my $type ( @type_list ) { my @list = &get_match_list( $type, $sql_query ) ; print Dumper( \@list ), "\n" ; } exit( 0 ) ; #----- F U N C T I O N S ---------------------------------------- sub start_tag { my $tag = shift ; my $start = "\\/\\*\\s*<$tag>\\s*\\*\\/" ; return $start ; } sub end_tag { my $tag = shift ; my $end = "\\/\\*\\s*<\\/$tag>\\s*\\*\\/" ; return $end ; } sub get_match_list { my ( $tag, $text ) = @_ ; my @match_list = () ; # create the start & end tags my ( $start, $end ) = ( &start_tag( $tag ), &end_tag( $tag ) ) ; # as long as you're finding tag pairs... while ( $text =~ m/$start\s*(.*?)\s*$end/gi ) { my $new_match = $1 ; # strip out any comments, and replace spaces on either end # with a single space (leaving line breaks alone) my $spc = $new_match =~ /[ \t]+\/\*.*?\*\/|\/\*.*?\*\/[ \t]+/ ? ' ' : '' ; $new_match =~ s/[ \t]*\/\*(.*?)\*\/[ \t]*/$spc/g ; # Strip any whitespace off the ends $new_match =~ s/^\s*(.*?)\s*$/$1/ ; push( @match_list, $new_match ) ; } return @match_list ; }

    The output looks like

    $VAR1 = [ 'ASSET_ID', 'ASSET_DESC', 'ASSET_COST' ]; $VAR1 = [ 'SGENQA', 'SGENTEST' ]; $VAR1 = [ 'ASSET_ID IS NOT NULL', 'ASSET_DESC IS NOT NULL', 'ASSET_COST > 100' ]; $VAR1 = [ 'SYNERGEN.SA_ASSET@SGENQA', 'SYNERGEN.SA_WORK_ORDER@SGENTEST' ]; $VAR1 = [];

    ...very nearly what i was aiming for, but I can't seem to get the regex to match across a bunch of lines. What do I need to do to make the QUERY tags match as intended?


    _______________
    D a m n D i r t y A p e
    Home Node | Email
      can't seem to get the regex to match across a bunch of lines
      I haven't fully examined your code, but the phrase above makes me think you should check out the /s regex modifier, which makes the wildcard '.' match a newline. Since you have '.*' in several of your regexes, its probably worth seeing if the /s modifier will make them behave the way you want them to.

      -Blake

      Use the /s regex modifier. It causes '.' to match newline (which it otherwise does not match):
      my $str = "abc\ndef"; $str =~ /(b.*e)/; print "1[$1]\n"; # 1[] Use of uninitialized value! $str =~ /(b.*e)/s; print "2[$1]\n"; # 2[bc\nde]
      Cheers!
      David