Data-driven Programming: fun with Perl, JSON, YAML, XML...

The programmer at wit's end for lack of space can often do best by disentangling himself from his code, rearing back, and contemplating his data. Representation is the essence of programming.

-- from The Mythical Man Month by Fred Brooks

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

-- Rob Pike

As part of our build and test automation, I recently wrote a short Perl script for our team to automatically build and test specified projects before checkin.

Lamentably, another team had already written a truly horrible Windows .BAT script to do just this. Since I find it intolerable to maintain code in a language lacking subroutines, local variables, and data structures, I naturally started by re-writing it in Perl.

Focusing on data rather than code, it seemed natural to start by defining a table of properties describing what I wanted the script to do. Here is a cut-down version of the data structure I came up with:

# Action functions (return zero on success).

sub find_in_file
{
   my $fname  = shift;
   my $str    = shift;
   my $nfound = 0;
   open( my $fh, '<', $fname ) or die "error: open '$fname': $!";
   while ( my $line = <$fh> ) {
      if ( $line =~ /$str/ ) {
         print $line;
         ++$nfound;
      }
   }
   close $fh;
   return $nfound;
}

# ...

# --------------------------------------------------------------------
+----
# Globals (mostly set by command line arguments)

my $bldtype = 'rel';

# --------------------------------------------------------------------
+----
# The action table @action_tab below defines the commands/functions
# to be run by this program and the order of running them.

my @action_tab = (
   {
      id      => 'svninfo',
      desc    => 'svn working copy information',
      cmdline => 'svn info',
      workdir => '',
      logfile => 'minbld_svninfo.log',
      tee     => 1,
      prompt  => 0,
      run     => 1,
   },
   {
      id      => 'svnup',
      desc    => 'Run full svn update',
      cmdline => 'svn update',
      workdir => '',
      logfile => 'minbld_svnupdate.log',
      tee     => 1,
      prompt  => 0,
      run     => 1,
   },
   # ...
   {
      id      => "bld",
      desc    => "Build unit tests ${bldtype}",
      cmdline => qq{bldnt ${bldtype}dll UnitTests.sln},
      workdir => '',
      logfile => "minbld_${bldtype}bldunit.log",
      tee     => 0,
      prompt  => 0,
      run     => 1,
   },
   {
      id      => "findbld",
      desc    => 'Call find_strs_in_file',
      fn      => \&find_in_file,
      fnargs  => [ "minbld_${bldtype}bldunit.log", '[1-9][0-9]* errors
+' ],
      workdir => '',
      logfile => '',
      tee     => 1,
      prompt  => 0,
      run     => 1,
   }
   # ...
);
[download]

Generally, I enjoy using property tables like this in Perl. I find them easy to understand, maintain and extend. Plus, a la Pike above, focusing on the data first usually makes the coding a snap.

Basically, the program runs a specified series of "actions" (either commands or functions) in the order specified by the action table. In the normal case, all actions in the table are run. Command line arguments can further be added to specify which parts of the table you want to run. For convenience, I added a -D (dry run) option to simply print the action table, with indexes listed, and a -i option to allow a specific range of action table indices to be run. A number of further command line options were added over time as we needed them.

Initially, I started with just commands (returning zero on success, non-zero on failure). Later "action functions" were added (again returning zero on success and non-zero on failure).

As the table grew over time, it became tedious and error-prone to copy and paste table entries. For example, if there are four different directories to be built, rather than having four entries in the action table that are identical except for the directory name, I wrote a function that took a list of directories and returned an action table. None of this was planned, the script just evolved naturally over time.

Now is time to take stock, hence this meditation.

Coincidentally, around the same time as I wrote my little script, we inherited an elaborate testing framework that specified tests via XML files. To give you a feel for these, here is a short excerpt:

<Test>
   <Node>Muss</Node>
   <Query>Execute some-command</Query>
   <Valid>True</Valid>
   <MinimumRows>1</MinimumRows>
   <TestColumn>
      <ColumnName>CommandResponse</ColumnName>
      <MatchesRegex row="0">THRESHOLD STARTED.*Taffy</MatchesRegex>
   </TestColumn>
   <TestColumn>
      <ColumnName>CommandExitCode</ColumnName>
      <Compare function="Equal" row="0">0</Compare>
   </TestColumn>
</Test>
[download]

Now, while I personally detest using XML for these sorts of files, I felt the intent was good, namely to clearly separate the code from the data, thus allowing non-programmers to add new tests.

Seeing all that XML at first made me feel disgusted ... then uneasy because my action table was embedded in the script rather than more cleanly represented as data in a separate file.

To allow my script to be used by other teams, and by non-programmers, I need to make it easier to specify different action tables without touching the code. So I seek your advice on how to proceed:

Encode the action table as an XML file.
Encode the action table as a YAML file.
Encode the action table as a JSON (JavaScript Object Notation) file.
Encode the action table as a "Perl Object Notation" file (and read/parse via string eval).
Turn the script and action table/s into Perl module/s.

Another concern is that when you have thousands of actions, or thousands of tests, a lot of repetition creeps into the data files. Now dealing with repetition (DRY) in a programming language is trivial -- just use a function or a variable, say -- but what is the best way of dealing with unwanted repetition in XML, JSON and YAML data files? Suggestions welcome.

References

Data-driven programming (wikipedia)
Data-driven programming (The Art of Unix Programming)
Data-driven Programs (c2.com)
JSON (wikipedia)
YAML (wikipedia)
YAML Spec (yaml.org)
XSLT (wikipedia)
DRY (wikipedia)

Update: I ended up taking BrowserUk's advice and leaving the script alone.

Back to Meditations