testing externally controlled data sources

revdiablo has asked for the wisdom of the Perl Monks concerning the following question:

My question is along similar lines as, and was actually inspired by, Using Test::More to unit test T-SQL. I was wondering if any monks have experience writing tests for code that relies on externally controlled data sources?

My specific example is I'm writing modules to perform simple statistical analysis on .db files created by a bayesian spam filter. The big problem I see is these files are modified by the spam filter on a regular basis, so I can't simply take known values out of the .db and assume they will stay the same. Are there any standard practices for dealing with this type of thing? Perhaps I should create my own .db files specifically for testing?

Oh, and why can't I just insert then delete known values into the database and test on those? Because I'm afraid to modify the .db files for fear of breaking the seemingly fragile bayesian spam filter. :)

Comment on testing externally controlled data sources

Replies are listed 'Best First'.
Re: testing externally controlled data sources by pg (Canon) on Feb 03, 2003 at 03:37 UTC
It is a good idea to create your own data, as you can not rely on the data you get from the spam filter. The reason is that when you test, you want to make sure that you test as many as possible different cases, within any limited amount of time, there is no gurantee that you would be able to gather enough data from the spam filter, so you have to create more cases. You want your program to cover more than what happened, and cover what could happen. But at the same time, you should also take samples from what the spam filter created. So what I am saying is that you should use two types of test data: the real ones from the spam filter, and the ones you created to fill the holes. Also one thing you want to do is to keep all the test data, and all the expected results for each case. If you have time (you don't always have time, that's always a problem, especially when you have a deadline ;-), you should create a small tool for yourself, hopefully, it would run each test case for you automatically, of course one by one, and compare the result with the expected result, and automatically create a report for you, tell you what cases failed and what passed. This is for the future. Don't expect that you will not modify your program later, and don't expect there is no bug, so the test data you created this time is a sort of very precious resource, keep them, and reuse them in the future.	[reply]
Re: testing externally controlled data sources by toma (Vicar) on Feb 03, 2003 at 03:35 UTC
I have done this type of testing, and I only know of two approaches. One is to keep a special database for testing. The other is to continuously fix the tests as the database changes. Depending on the nature of your tests, you many need to do some of each. To test new data, keep the code constant. To test new code, keep the database constant. The trick is to test one dimension at a time. Don't test new code and new data at the same time. If it is impractical to keep a test database, you may be able to separate the code changes from the data changes using statistics. I use R for this type of statistics. It should work perfectly the first time! - toma	[reply]
Re: testing externally controlled data sources by Ryszard (Priest) on Feb 03, 2003 at 09:36 UTC
To follow on from what pg has mentioned, what you're after is control. You need to build your own data and seed it with results you will expect to see foreach type of test case, postive, negative, and any boundary conditions that may apply. As you know what the data is, you can safely expect to see your it in your output. After the unit tests are successful (and documented, being careful to retain the data, and making sure all the tests are reproducible) it generally a good idea to grab some production data and perform another series of tests... From here you can sumbit you system to UAT (User Acceptance Testing) and final QA and signoff by the owners of the process/procedure or whatever. Depending on youre environment, Unit Testing that is well documented may be enuff, however keep in mind, your goal, is to prove your programming works as advertising. The required level of proof will differ from site to site, however IMO, one should at the very minimum have fully documented test cases at the Unit Test level. Also having someone test your system who has not been involved in the design/coding is generally a good idea as they bring a fresh set of eyes and ideas that you may not have considered.	[reply]
Re: Re: testing externally controlled data sources by revdiablo (Prior) on Feb 04, 2003 at 00:43 UTC
Thanks to Ryszard, pg, and toma for your replies. Considering these modules are being created solely for my own consumption, I think the UAT, QA, and final signoff might not necessarily be appropriate :) ... but that's fine advice for any commercial endevours, and I thank you for it. These tests were really intended as a sort of soft introduction to testing based on some simple code I'd just recently written. With that in mind, I think a simple test .db and basic unit tests will probably suffice for now.	[reply]