DACONTI has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks,
I would like to have some wisdom's advice.
I've following problem:
I'm trying to setup an automatic testing system for DB programs (Informatica ETLs and PL/SQL programs) basend on my testadata generation tool(http://dbfeeder.sourceforge.net/)
Now one of the major problems that I encouter is to get data which fits to program's database queries.
I try to achieve this goal by parsing program's code and in particular SQL queries, to detect literal values (strings, numbers, dates) and join condition between tables.
I decided to adopt following strategy:
- use Parse::Recdescent
- limit the parsing to join conditions and literal expression (ie not try to query the whole sql)
- collect all db information (table structures, column names and types) in advance
- accept failures (few undetected literals or join conditions in a long query could be acceptable since that can be fixed manually afterwards in DBFeeder).
I thought at the beginning that in this way I could be faster than implementing a formal SQL parser in arse::Recdescent and then look for the literals.
Now I'm not sure if this is the right strategy to go.
How would you proceed to get literals and join conditions from SQLs?

Thanks in advance for any suggestion.
Best Regards,
Davide.

Replies are listed 'Best First'.
Re: SQL Parsing with Parse::RecDescent
by Fletch (Bishop) on Mar 05, 2008 at 13:46 UTC

    First thing I'd do would be to see if SQL::Statement solves the problem off the shelf rather than wheel-reinventing myself.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

      Hi Fletch,
      I was very optimistic regarding your idea.
      I read the doc and did some tests with SQL::Statement.
      Unfortunately it seems not to be straightforward:
      Using this code:
      select some_table.a_column, some_table.a_2ndcolumn FROM some_table UNION select another_table.a_3rdcolumn, another_table.a_4thcolumn FROM another_table WHERE some_table.a_date_column <= to_date('20071111', 'YYYYMMDD') AND some_table.a2nd_date_column >= to_date('20071111', 'YYYYMMDD') }
      I couldnt get all columns identified and it doesnt seem to recognize the UNION join (by the way I'm using SQL ORACLE syntax).
      So I've the impression that one should make extensive customization work in order to get it really do the work.
      In this case I would preferably stay with Parse::RecDescent
      Don't you know about some SQL query parsing example with parse-recdescent?
      Anyway many thanks
      Davide.
Re: SQL Parsing with Parse::RecDescent
by dragonchild (Archbishop) on Mar 05, 2008 at 15:29 UTC
    If SQL::Statement doesn't do it for you, try DBIx::MyParsePP. It's specifically for MySQL, but MySQl does implement SQL-92 and SQL-99.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: SQL Parsing with Parse::RecDescent
by moritz (Cardinal) on Mar 05, 2008 at 13:36 UTC
    How would you proceed to get literals and join conditions from SQLs?

    By running the program, and letting your database (or DBI) log the queries for you.

    With your approach you won't find SQL that has interpolated variables, or that is dynamically created (think DBIx::Class).

      >>By running the program, and letting your database (or DBI) log the queries for you
      Do you mean report results from the tested ETL program itself?
      No I can't use this, because we test these programs at 1st development stage where we have no real productive data which matches their queries, so when we let them running they return
      ... no rows found ...
      But thanks for the advice, I'll have a look at DBIx::Class
      Davide.
        You only need one row per table to be able to log many of the SQL statements.

        And let's try to be serious: which developer writes an application without having a single test data set, even if it's an inveted and not very natural one?

        Update: another idea is an incremental approach: Run the program once, and hope that it will run a SQL statement. Then generate some test data for that query, and run again.

        I just think that "blind" searching in source code for SQL statements will not work reliably, it will have both many false positives and false negatvies.