http://qs1969.pair.com?node_id=1123944

The programmer at wit's end for lack of space can often do best by disentangling himself from his code, rearing back, and contemplating his data. Representation is the essence of programming.

-- from The Mythical Man Month by Fred Brooks

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

-- Rob Pike

As part of our build and test automation, I recently wrote a short Perl script for our team to automatically build and test specified projects before checkin.

Lamentably, another team had already written a truly horrible Windows .BAT script to do just this. Since I find it intolerable to maintain code in a language lacking subroutines, local variables, and data structures, I naturally started by re-writing it in Perl.

Focusing on data rather than code, it seemed natural to start by defining a table of properties describing what I wanted the script to do. Here is a cut-down version of the data structure I came up with:

# Action functions (return zero on success). sub find_in_file { my $fname = shift; my $str = shift; my $nfound = 0; open( my $fh, '<', $fname ) or die "error: open '$fname': $!"; while ( my $line = <$fh> ) { if ( $line =~ /$str/ ) { print $line; ++$nfound; } } close $fh; return $nfound; } # ... # -------------------------------------------------------------------- +---- # Globals (mostly set by command line arguments) my $bldtype = 'rel'; # -------------------------------------------------------------------- +---- # The action table @action_tab below defines the commands/functions # to be run by this program and the order of running them. my @action_tab = ( { id => 'svninfo', desc => 'svn working copy information', cmdline => 'svn info', workdir => '', logfile => 'minbld_svninfo.log', tee => 1, prompt => 0, run => 1, }, { id => 'svnup', desc => 'Run full svn update', cmdline => 'svn update', workdir => '', logfile => 'minbld_svnupdate.log', tee => 1, prompt => 0, run => 1, }, # ... { id => "bld", desc => "Build unit tests ${bldtype}", cmdline => qq{bldnt ${bldtype}dll UnitTests.sln}, workdir => '', logfile => "minbld_${bldtype}bldunit.log", tee => 0, prompt => 0, run => 1, }, { id => "findbld", desc => 'Call find_strs_in_file', fn => \&find_in_file, fnargs => [ "minbld_${bldtype}bldunit.log", '[1-9][0-9]* errors +' ], workdir => '', logfile => '', tee => 1, prompt => 0, run => 1, } # ... );

Generally, I enjoy using property tables like this in Perl. I find them easy to understand, maintain and extend. Plus, a la Pike above, focusing on the data first usually makes the coding a snap.

Basically, the program runs a specified series of "actions" (either commands or functions) in the order specified by the action table. In the normal case, all actions in the table are run. Command line arguments can further be added to specify which parts of the table you want to run. For convenience, I added a -D (dry run) option to simply print the action table, with indexes listed, and a -i option to allow a specific range of action table indices to be run. A number of further command line options were added over time as we needed them.

Initially, I started with just commands (returning zero on success, non-zero on failure). Later "action functions" were added (again returning zero on success and non-zero on failure).

As the table grew over time, it became tedious and error-prone to copy and paste table entries. For example, if there are four different directories to be built, rather than having four entries in the action table that are identical except for the directory name, I wrote a function that took a list of directories and returned an action table. None of this was planned, the script just evolved naturally over time.

Now is time to take stock, hence this meditation.

Coincidentally, around the same time as I wrote my little script, we inherited an elaborate testing framework that specified tests via XML files. To give you a feel for these, here is a short excerpt:

<Test> <Node>Muss</Node> <Query>Execute some-command</Query> <Valid>True</Valid> <MinimumRows>1</MinimumRows> <TestColumn> <ColumnName>CommandResponse</ColumnName> <MatchesRegex row="0">THRESHOLD STARTED.*Taffy</MatchesRegex> </TestColumn> <TestColumn> <ColumnName>CommandExitCode</ColumnName> <Compare function="Equal" row="0">0</Compare> </TestColumn> </Test>

Now, while I personally detest using XML for these sorts of files, I felt the intent was good, namely to clearly separate the code from the data, thus allowing non-programmers to add new tests.

Seeing all that XML at first made me feel disgusted ... then uneasy because my action table was embedded in the script rather than more cleanly represented as data in a separate file.

To allow my script to be used by other teams, and by non-programmers, I need to make it easier to specify different action tables without touching the code. So I seek your advice on how to proceed:

Another concern is that when you have thousands of actions, or thousands of tests, a lot of repetition creeps into the data files. Now dealing with repetition (DRY) in a programming language is trivial -- just use a function or a variable, say -- but what is the best way of dealing with unwanted repetition in XML, JSON and YAML data files? Suggestions welcome.

References

Update: I ended up taking BrowserUk's advice and leaving the script alone.

See also: A good way to input data into a script w/o an SQL database (2023) and Data Structure References

Replies are listed 'Best First'.
Re: Data-driven Programming: fun with Perl, JSON, YAML, XML...
by LanX (Saint) on Apr 19, 2015 at 11:47 UTC
    Personally I'd prefer working with a Domain Specific Language for configuration, far more flexible!

    Each entry must be a sub in your my_config package, action is a sub action (&) {...} taking a code block with options.

    It's not only easy to read for non programmers, but also easy to validate (just add prototypes and do argument checking within the subs)

    package my_config; action { id 'svninfo'; desc 'svn working copy information'; cmdline 'svn info'; workdir ''; logfile 'minbld_svninfo.log'; tee 1; prompt 0; run 1; }; action { id 'svnup'; desc 'Run full svn update'; cmdline 'svn update'; workdir ''; logfile 'minbld_svnupdate.log'; tee 1; prompt 0; run 1; }; # ...

    (update: features which are only boolean could be handled without arguments with useful defaults, i.e. run 1 is simply run and run 0 is skipped)

    DRY could be achieved with preloaded constants, nested structures or special commands to set defaults like

    default workdir => ''; # or default { workdir ''; } # or just a top-level setting workdir ''; action { # workdir can be missing now }

    The fact that you can insert arbitrary perlcode is a mixed bag, it might add extra flexibility but also be a source of bugs with non-hackers. Deactivating builtins in CORE might be an option...

    Dunno if there is already an appropriate module on CPAN (?), otherwise I'd volunteer producing one for the comming GPW.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Re: Data-driven Programming: fun with Perl, JSON, YAML, XML...
by hdb (Monsignor) on Apr 19, 2015 at 10:22 UTC

    Looking at your data structure @action_tab, it seems that you need the same properties for each test case. For such a tabular data structure, I would use an Excel sheet for maintenance and also to let others add more test cases (lines) to it. (Excel is on everybody's computer at my workplace anyway.) As there are good Perl modules to read Excel data it is rather easy to read the sheet (even if you yourself are on *nix), do some checks whether the entries look correct and then run your tests. Excel's filters and sorting also let people easily look for existing tests.

Re: Data-driven Programming: fun with Perl, JSON, YAML, XML...
by BrowserUk (Patriarch) on Apr 19, 2015 at 18:15 UTC

    I'm perplexed to understand what you are hoping to achieve here.

    First you say:

    Lamentably, another team had already written a truly horrible Windows .BAT script to do just this. Since I find it intolerable to maintain code in a language lacking subroutines, local variables, and data structures, I naturally started by re-writing it in Perl.

    Great! You're going to replace the severely limited batch language with the power of a proper programming language.

    But then, you decide:

    To allow my script to be used by other teams, and by non-programmers, I need to make it easier to specify different action tables without touching the code.

    And so opt to make your script data driven; with the result that:

    Another concern is that when you have thousands of actions, or thousands of tests, a lot of repetition creeps into the data files. Now dealing with repetition (DRY) in a programming language is trivial -- just use a function or a variable, say -- but what is the best way of dealing with unwanted repetition in XML, JSON and YAML data files?

    And so you come full circle and ended up with mechanism that has less flexibility, power and control than the batch script you started with. The link you provided lists AWK & sed as the other examples of data-driven programming; and Perl was designed to replace them.

    And to what end?

    I find the idea that non-programmers are going to be designing/configuring builds and tests of production application software totally untenable; akin to asking a non-musician to conduct the orchestra rehearsals.

    Seems to me that you've lost sight of what your trying to achieve and been deceived by a new buzzword for an old idea.

    Even the data formats you've suggested are a) overkill; b) require a programmer's understanding of the format to use. You mentioned liking "table-driven", so why are you looking at hierarchal data formats instead of a simple table of CSV data?

    My take would be to step back. Look at a few examples of actual use cases and look for the simplest solution to solving those use cases; rather than trying to invent something that is all things to all possibilities.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Data-driven Programming: fun with Perl, JSON, YAML, XML...
by CountZero (Bishop) on Apr 19, 2015 at 16:48 UTC
    Personally, I'm a big fan of YAML, it is expressive, easy to parse, read and write, but none of that is relevant for your question. Just use the format your co-workers are most familiar with, or even use all of these formats. It wouldn't be to difficult to write a routine that can translate each format into your internal data-representation and back again into another format.

    The only formats I wouldn't use are those that are directly evaled as this will have important security concerns or those that put the data into modules, again for security reasons.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Data-driven Programming: fun with Perl, JSON, YAML, XML...
by RichardK (Parson) on Apr 19, 2015 at 11:24 UTC

    One way to reduce the amount of repetition is to take a more OO approach and give each action a type or base action (class) that contains the common parameters. Then to look up a parameter, first check the action and if it does not exists then check the base action. This will make it much easier to add new actions and prevent a lot of cut & paste. Each new action then only need contain the relevant information, not the boring boilerplate stuff.

    e.g. something along these lines :-

    [ {id => 'test_build', workdir =>'build/t', type => 'default_test'}, {id => 'test_output', workdir => 'output/t', type => 'default_test'} +, ... {id => 'default_test', command => 'prove', logfile => 'test.out', te +e => 1, prompt => 0, run => 1}, ... ]

    My personal preference is to use json for this type of thing, but it is just that: a personal preference. Use whatever you and your users will find most comfortable.

Re: Data-driven Programming: fun with Perl, JSON, YAML, XML...
by LanX (Saint) on Apr 19, 2015 at 12:02 UTC
    > Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

    I disagree that this is generally true.

    For instance tye once made a good point in criticizing Moose objects for centering too much around the attributes they have.

    Objects and classes should be primarily defined by their methods (what they do) and not their attributes (aka data they have)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!