BillB has asked for the wisdom of the Perl Monks concerning the following question:

I am fairly new to using regular expressions in perl and I am trying to do the following task.

I am using the following match to search a sql script

if $ScanString =~ m/insert\s.+doc.+[\(|values|select|;]?/is;

It is returning true on the following code fragment. What I want is that if the table name of DOC follows the INSERT, but if it gets to any of "(", SELECT or ";" it will return false. In other words if the DOC is not there before I hit "(", SELECT or ";" it will always return false. The perl match from above is returning a false positive because the string "DOC_NUM" is in the attached select. However the full script can have many inserts. If any of them match the pattern I want it to be true. If none of them match I want the test to be false. Any help would be appreciated. The script fragment follows

INSERT INTO gl_trn ( CO_CD, GL_ACCT_CD, JRNL_CD, EMP_CD_OP, POST_DT, DES, REF, DC_CD, AMT, TRN_DT, GL_TRN_ORIGIN_CD, FINAL_DT, CO_UNIT, TRN_TIME, CMNT, BNK_ACCT_CD, BNK_TRN_TP_CD, DOC_NUM, POST_ID_NUM, BNK_CO_CD) SELECT 'XXX', GL_ACCT_CD, NULL, EMP_CD_OP, to_date('&to_post_dt', 'MMD +DRR'), 'REVERSAL'||' '||substr(des, 1, 30), 'R'||substr(REF, 1, 18), decode(dc_cd, 'C', 'D', 'D', 'C'), amt, TRUNC(SYSDATE), gl_trn_origin_cd, NULL, co_unit, floor((sysdate-trunc(sysdate))*86400), 'Auto-generated by DEV using ACREV', BNK_ACCT_CD , BNK_TRN_TP_CD, DOC_NUM, POST_ID_NUM , BNK_CO_CD FROM gl_trn WHERE post_dt = to_date('&from_post_dt', 'MMDDRR') AND jrnl_cd = 'A' AND co_cd = 'RAY' AND NOT EXISTS (SELECT 'X' FROM gl_reversed g WHERE g.GL_TRN_SEQ_NUM = gl_trn.gl_trn_seq_num);

Replies are listed 'Best First'.
Re: Can I have a hard stop on a match
by haukex (Archbishop) on Nov 17, 2017 at 14:16 UTC

    The general problem with the regex is that . matches any character (because you've used /s), and that .+ is by default greedy, that is, it will match as many times as possible. Have a look at perlretut (pay attention to mentions of "greed").

    One way to solve this is to change the dot . to an expression that says "match anything except this". For single characters, one could use a negated character class, as in [^a-z]* which will match any characters except lowercase ASCII letters. For longer strings, one way to do it is with a negative lookahead: (?:(?!not|these|strings).)* will give you the "hard stop" you want.

    You've only provided one example string, but when developing regular expressions, the more test cases the better. I showed one way to write tests for regexes here: Re: How to ask better questions using Test::More and sample data. For now I've guessed a few test cases, but you should fill that out. This also works on your long string:

    use warnings; use strict; use Test::More; my $regex = qr{ \b INSERT \b (?: # the next thing may not be one of these: (?! \b SELECT \b | \( | \; ) . )+ # match one or more times \b doc \b # target string }imsx; like "INSERT doc (", $regex; like "INSERT INTO doc bar (", $regex; unlike "INSERT foo ( doc", $regex; unlike "INSERT foo SELECT doc", $regex; unlike "INSERT foo ; doc", $regex; done_testing;

    Update before posting: Eily already posted something very similar here, I'll post this anyway. Update: Adding \b is a good point. Update 2: Minor edits for clarity.

Re: Can I have a hard stop on a match
by Eily (Monsignor) on Nov 17, 2017 at 14:04 UTC

    Edit: actually I think holli's answer is better. It's just easier to use a tool designed specifically for SQL rather than try to parse the request and try to think of all the edge cases.

    You should switch to Ruby and use the (?~ ) construct which makes solving your problem easier. While it is possible to solve your problem in a single regex, using more advanced features, there's also the simpler solution of dividing the problem in more simple steps. For example you could first extract the string between INSERT and either ( or SELECT, and then search for DOC in that substring.

    # Untested my $ScanString =~ / ( # start of capture + INSERT .*? # the ? after * means + the shortest possible match ) # end of capture (?:\(|values|select|;) # (?: thing ) group +s without capturing /six; # s allows .* to match across several lines +. x allows spaces and comments in the regex my $substring = $1; my $found_doc = $substring =~ /\b doc \b/ix; # \b means boundary, so b +eginning or end of a word
    By the way, [\(|values|select] means "either (, or | or v, or a, or l, or u, or e, or s, or | or s....", grouping is done with parentheses.

Re: Can I have a hard stop on a match
by roboticus (Chancellor) on Nov 17, 2017 at 14:01 UTC

    BillB:

    I'd suggest placing "\s" just before and after doc in your regex, so it will accept doc as a word instead of as a word fragment.

    Since you only want to match it if it precedes '(', 'SELECT' or ';', another way to do it would be to split your string on any of those tokens before doing your check, so any subselect or statement following your insert would be excluded from your match.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Can I have a hard stop on a match
by holli (Abbot) on Nov 17, 2017 at 14:04 UTC
    I wouldn't use a Regex for this because of the problems you describe, false positives et al. SQL::Parser looks like a promising candidate for doing this reliably.


    holli

    You can lead your users to water, but alas, you cannot drown them.
      SQL::Parser

      This was my first thought as well, to use a parser instead of a regex whenever possible. Unfortunately, it turned out that it doesn't seem to work. I checked the SQL::Statement::Syntax docs and they say that only INSERT statements of the form INSERT INTO $table [ ( $col1, ..., $colN ) ] VALUES ( $val1, ... $valN ) are supported, and "inserting from a subquery is not currently supported". Posting the code here anyway, maybe I missed something.

Re: Can I have a hard stop on a match
by AnomalousMonk (Archbishop) on Nov 17, 2017 at 16:51 UTC

    If you have Perl version 5.10+ regex extensions, you might also consider the  (*COMMIT) backtracking control verb. Used in conjunction with (*FAIL), this can produce a "hard" stop if a pattern is found that's as hard as you could ever want.


    Give a man a fish:  <%-{-{-{-<

Re: Can I have a hard stop on a match
by 1nickt (Canon) on Nov 17, 2017 at 14:02 UTC

    Hi,

    doc.+
    This matches 'doc' followed by one or more of any character, so naturally 'DOC_NUM' matches (with /i). If you are looking for a match of your string followed by a comma or a space, make it so:
    doc[,\s]

    Hope this helps!


    The way forward always starts with a minimal test.