Performance Data and Graphing Metadata File Formats and Transformations

zerohero has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently using Perl to do performance test coordination to benchmark various aspects of a system. The typical cycle is to identify something that you want to generate a scatter graph for (i.e. X/Y plot), and then the items on the X and Y axes. For example "messages per second" on Y, and "processing units" on X. The test coordinator will execute a sequence of tests, incrementing the X value for each test run. Each test run generates a new plot point (X,Y value). This is actually a slight simplification as each test run generates many (X,Y) values, but for different system aspects. For example for a given X we would get both "messages per second" and "cpu utilization of server app".

The desired end product of this process is a nice looking graph somewhere (e.g. web page).

At the end of the test run process I have a bunch of single line entries in a logfile which are the test results. I then have to go through an annoying time consuming process to put these in MS excel, and make graphs. Obviously there are a plethora of much better solutions than this.

Thinking a bit outside the obvious desire to "automate things", it would be nice to have all of the data from these important tests nicely encapsulated and self-describing in a text file. That way it doesn't become a "junk file" on some computer. Another important aspect would be having something that could ingest this data, and given suitable graphing metadata, transform this into a graph, or, in my particular case, transform it into javascript which will render it on a web page (when hooked up to a suitable graph package, like YUI). Thus, I'm looking at this as a classic data representation and transformation problem (for which Perl is perfectly suitable).

My thought was to have a file format that encapsulated a single "test run". The file format would be text, and self-describing. It would have metadata to describe the conditions of the test (e.g. title, hardware, whatever), as well as the data from successive runs. Using this file as input, we would have some sort of additional specification to generate a graph which would have to select the columns and range, as well as handle things like multiple graph plots, etc. The target output rendering would be (*gasp*) javascript, which would be an encapsulated json data object inside a web page presented by some graph package (Yahoo has a nice one that is being developed by their YUI group).

Thus I'm really not necessarily interested in the graphing portion of things (since there are probably many solutions to this old part), but more interested in the middle "transformation" part and the input file formats. The first input file format, as I described, is the test run data container, and the second input file format is the graph metadata container. Of course I'll be using Perl to implement and do all of the transformation work.

Note, I see XML as too heavyweight for such a simple thing.

Now that I've described what I'm after, I'm hoping that some of you will say "hey, that sounds kind of like xyz" (where xyz is the Perl library that does this sort of thing). Alternatively, identifying technology to handle pieces of the whole problem in a nice way is helpful (e.g. use YAML for the file formats so the file formats are generally useful outside of Perl). I could obviously write my own, but am guessing someone has solved this problem in a nice way already.

Comment on Performance Data and Graphing Metadata File Formats and Transformations

Replies are listed 'Best First'.
Re: Performance Data and Graphing Metadata File Formats and Transformations by zerohero (Monk) on Jul 03, 2009 at 13:03 UTC
OK, I'll answer my own question here. The solution I settled on was to use YAML to specify a variety of simple file formats. Perl ingests these and transforms the various file formats into new file formats. The reason to have different types of input files is for things like separating the data model (performance data) from the view (graph), and being able to reuse config information (e.g. metric definitions). At the final stage, where we need to take data series and marry this to HTML and javascript, we use the venerable Template Toolkit. The result is a data file which is durable, computer readable, visually pleasing, and self-contained, as well as HTML/javascript graphs (or pick some other technology/library), and some reusable metedata files (e.g. metric defs, graph config). I'm amazed at how easy this is with YAML/Perl/TT. Step one is to take the logfiles which are emitted by the load client as single lines of name=value pairs. This is an extremely useful format for these types of programs which tend to be C or C++ since it doesn't require much to output this format. Emitting things as YAML at this stage is, in my opinion, overkill. The "single-line, name-value pair metric" format is something that is a pretty easy pill to swallow, even for very high performance programs where we don't want to introduce too much of a burden. The next step is to produce what I call a "yaml-gram". A yaml-gram is a "self-contained, visually pleasing, data-oriented, instance of a typed YAML document". In my case, this was performance data from load tests on a messaging system. Thus the type of this document was PTD (performance test data) which has a simple schema. In order to make the data "self-contained" I found it useful to have not only the metric data, but also the _meanings_ of the metrics as well as their units, and the test setups. This is necessary so that the data can be properly reused. If you don't know the meaning of the metric, or there is some ambiguity, then it becomes junk data (I found that providing the unit of measure, and a description, keyed to the metric name, were sufficient). In addition the test setups should be specified such that they can be used as input to drive the tests to be executed again. Note that the description of the metrics stays the same for large numbers of tests, and therefore becomes it's own YAML file, which gets married to the data file to produce the final PTD yaml file. In addition there's a tt template for any boilerplate for this particular test series (e.g. software versions used). Ingesting these various file types (logfile, metric meta file, tt file) and operating on them is so simple with YAML, it's fun. One thing I found challenging was Dumping the yaml file. Sure, Dump () "works", but it doesn't seem to render things in a very pleasing format. Since yaml is designed to be human readable, I found it necessary to concentrate on the human readable aspects of the file to make it visually pleasing (there is probably a perl lib that does this). I wound up writing a little formatter which will take the metrics (a hash of single dim arrays of numbers) and block format them so that they have consistent padding/spacing. This makes it look really readable. At this stage we have a computer readable, self-contained (much of the data needed to understand and use the metrics is in the same document), visually pleasing, yaml-gram (ptd). All the pieces of it are very useful, and we likely won't have to change this format, since data models are usually fairly straightforward. I did find that I organized the data _much differently_ than I would had I used XML. This was because I concentrated on human readability (thus the data, description and units are their own sections, rather than colocating each under a single metric). Next we take a YAML input file which provides information needed to make an abstract scatter graph. We indicate a "common" series (for the X, which will be shared). This is typically the step value in the tests. We can also indicate several "Y" series (which will share the X coordinates). The metrics are given as 1 dim arrays, so you combine the common to get an array of coordinate pairs. We add other details like labels in at this point. So the next stage requires reading our ptd file and this file selecting the pieces, and then putting it together into a single hash to pass to Template Toolkit. TT then takes this and renders an HTML file, with javascript graph display code. I used a package called Flot, which is pure javascript and claims to be cross browser. I just took one of the example files, and did "cut and paste" programming with TT. That is, I found the place where I needed an array of numbers and replaced it with my reference to my variables. This part was extremely quick and I didn't really have to learn too much about the graph package, which was my intent. The result was extremely pleasing, auto-generated, javascript interactive graphs. The perl command-line transformation tools which do each step may of course be chained together to form a pipeline (another thing that's wonderfully simple using perl and Getopts::Long). Further, by focusing on the file formats and data transformation, the problem is solved in a general way that doesn't tie me to a particular rendering solution. I'm very impressed with the Perl/YAML/TT combination.	[reply]

Replies are listed 'Best First'.

Re: Performance Data and Graphing Metadata File Formats and Transformations
by zerohero (Monk) on Jul 03, 2009 at 13:03 UTC

OK, I'll answer my own question here. The solution I settled on was to use YAML to specify a variety of simple file formats. Perl ingests these and transforms the various file formats into new file formats. The reason to have different types of input files is for things like separating the data model (performance data) from the view (graph), and being able to reuse config information (e.g. metric definitions). At the final stage, where we need to take data series and marry this to HTML and javascript, we use the venerable Template Toolkit. The result is a data file which is durable, computer readable, visually pleasing, and self-contained, as well as HTML/javascript graphs (or pick some other technology/library), and some reusable metedata files (e.g. metric defs, graph config).

I'm amazed at how easy this is with YAML/Perl/TT.

Step one is to take the logfiles which are emitted by the load client as single lines of name=value pairs. This is an extremely useful format for these types of programs which tend to be C or C++ since it doesn't require much to output this format. Emitting things as YAML at this stage is, in my opinion, overkill. The "single-line, name-value pair metric" format is something that is a pretty easy pill to swallow, even for very high performance programs where we don't want to introduce too much of a burden.

The next step is to produce what I call a "yaml-gram". A yaml-gram is a "self-contained, visually pleasing, data-oriented, instance of a typed YAML document". In my case, this was performance data from load tests on a messaging system. Thus the type of this document was PTD (performance test data) which has a simple schema. In order to make the data "self-contained" I found it useful to have not only the metric data, but also the _meanings_ of the metrics as well as their units, and the test setups. This is necessary so that the data can be properly reused. If you don't know the meaning of the metric, or there is some ambiguity, then it becomes junk data (I found that providing the unit of measure, and a description, keyed to the metric name, were sufficient). In addition the test setups should be specified such that they can be used as input to drive the tests to be executed again. Note that the description of the metrics stays the same for large numbers of tests, and therefore becomes it's own YAML file, which gets married to the data file to produce the final PTD yaml file. In addition there's a tt template for any boilerplate for this particular test series (e.g. software versions used). Ingesting these various file types (logfile, metric meta file, tt file) and operating on them is so simple with YAML, it's fun.

One thing I found challenging was Dumping the yaml file. Sure, Dump () "works", but it doesn't seem to render things in a very pleasing format. Since yaml is designed to be human readable, I found it necessary to concentrate on the human readable aspects of the file to make it visually pleasing (there is probably a perl lib that does this). I wound up writing a little formatter which will take the metrics (a hash of single dim arrays of numbers) and block format them so that they have consistent padding/spacing. This makes it look really readable. At this stage we have a computer readable, self-contained (much of the data needed to understand and use the metrics is in the same document), visually pleasing, yaml-gram (ptd). All the pieces of it are very useful, and we likely won't have to change this format, since data models are usually fairly straightforward. I did find that I organized the data _much differently_ than I would had I used XML. This was because I concentrated on human readability (thus the data, description and units are their own sections, rather than colocating each under a single metric).

Next we take a YAML input file which provides information needed to make an abstract scatter graph. We indicate a "common" series (for the X, which will be shared). This is typically the step value in the tests. We can also indicate several "Y" series (which will share the X coordinates). The metrics are given as 1 dim arrays, so you combine the common to get an array of coordinate pairs. We add other details like labels in at this point. So the next stage requires reading our ptd file and this file selecting the pieces, and then putting it together into a single hash to pass to Template Toolkit. TT then takes this and renders an HTML file, with javascript graph display code. I used a package called Flot, which is pure javascript and claims to be cross browser. I just took one of the example files, and did "cut and paste" programming with TT. That is, I found the place where I needed an array of numbers and replaced it with my reference to my variables. This part was extremely quick and I didn't really have to learn too much about the graph package, which was my intent.

The result was extremely pleasing, auto-generated, javascript interactive graphs. The perl command-line transformation tools which do each step may of course be chained together to form a pipeline (another thing that's wonderfully simple using perl and Getopts::Long). Further, by focusing on the file formats and data transformation, the problem is solved in a general way that doesn't tie me to a particular rendering solution. I'm very impressed with the Perl/YAML/TT combination.

[reply]