Recently I became curious if using Mysql for anything that needed to store data was the best way. I decided it was time to learn howto use Benchmark so the following results may be flawed.
due to it's size it is not included in the post. Feel free to point out where my method or program may be flawed.
Tests were done on the following machine:
Linux 2.2.6 i686
dual PIII 500 processors, 256MB Ram
Mysql server version 3.22.25
I tested Mysql, Text::CSV_XS, and flatfile (pipe delimited) storage. The data access methods I tested were insert, select, update, and delete. I'll cover my approach for each below. I am aware that there are ways to optimize flatfile and perhaps CSV access using an index of the row start positions in another file. The Perl Cookbook shows a way to create an index in recipe 8.8 for the curious. I am sure there are other ways to improve access speeds on flatfile and CSV data, however the intent was to test the "basic" methods as
I know them.
Setup: I connected to the Mysql database, created a new Text::CSV_XS object, defined my csv file and flatfile. Then I created an array of hashes containing 200 different "row" hashes. Each hash looked like this ($_) is the current count (1..200):
{
ID => $_,
name => "person_$_",
address => "$_ Any St.",
city => "Anytown",
state => "AnyPlace",
zip => $num
}
Each iteration for Insert, Select, Update, and Delete does the appropriate action on EACH of these 200 items, so 500 iterations is actually doing 100000 Inserts, Selects or Updates. Deletes are only done once per item due to the data being destroyed in the process.
Insert: The main item of note for the insert tests was my use of
$sth->bind_columns to collect the data. I used this method after reading
this node by
kudra. For CSV and flatfile inserts were basically just appended text to the end of the file.
Insert Results:
Benchmark: timing 500 iterations of CSV_Insert, Flatfile_Insert, Mysql
+_Insert...
CSV_Insert: 12 wallclock secs (11.30 usr + 0.16 sys = 11.46 CPU) @ 43
+.63/s (n=500)
Flatfile_Insert: 1 wallclock secs ( 1.00 usr + 0.10 sys = 1.10 CPU)
+ @ 454.55/s (n=500)
Mysql_Insert: 59 wallclock secs (31.21 usr + 3.42 sys = 34.63 CPU) @
+14.44/s (n=500)
Select: For select statements in flatfile and CSV I just looped through the lines of the file and compared the "ID" field with the desired "ID" value. All selects were done on the "ID" value as opposed to other values.
Select Results:
Benchmark: timing 500 iterations of CSV_Select, Flatfile_Select, Mysql
+_Select...
CSV_Select: 7329 wallclock secs (7260.21 usr + 27.88 sys = 7288.09 CPU
+) @ 0.07/s (n=500)
Flatfile_Select: 14 wallclock secs (10.54 usr + 3.84 sys = 14.38 CPU)
+ @ 34.77/s (n=500)
Mysql_Select: 143 wallclock secs (35.87 usr + 6.15 sys = 42.02 CPU) @
+ 11.90/s (n=500)
Update: Updates for flatfile and CSV were done by looping through the lines of the file and copying each line to a .new file while altering the "updated" line before the copy. After the file had been looped through the .new file was renamed as the original. Again our "key" was the "ID" field.
Update Results:
Benchmark: timing 500 iterations of CSV_Update, Flatfile_Update, Mysql
+_Update...
CSV_Update: 17312 wallclock secs (16920.40 usr + 135.93 sys = 17056.33
+ CPU) @ 0.03/s (n=500)
Flatfile_Update: 244 wallclock secs (207.73 usr + 31.91 sys = 239.64 C
+PU) @ 2.09/s (n=500)
Mysql_Update: 263 wallclock secs ( 8.27 usr + 2.63 sys = 10.90 CPU) @
+ 45.87/s (n=500)
Delete: Deleting data offered a difficult challenge to me, how do you delete data for more than one iteration since you've obviously destroyed your dataset after the first iteration? I was unable to find a way to do this that I didn't think would skew the results, so deleting was only done once thus the results are likely inaccurate.
Delete Results:
Benchmark: timing 1 iterations of CSV_Delete, Flatfile_Delete, Mysql_D
+elete...
CSV_Delete: 16 wallclock secs (14.91 usr + 0.30 sys = 15.21 CPU) @ 0
+.07/s (n=1)
(warning: too few iterations for a reliable count)
Flatfile_Delete: 0 wallclock secs ( 0.18 usr + 0.04 sys = 0.22 CPU)
+ @ 4.55/s (n=1)
(warning: too few iterations for a reliable count)
Mysql_Delete: 1 wallclock secs ( 0.01 usr + 0.02 sys = 0.03 CPU) @
+33.33/s (n=1)
(warning: too few iterations for a reliable count)
My Conclusions: In all tests it would appear flatfile was the winner with Mysql coming in a close second. CSV appears to be quite slow so I think I will only use it if I'm forced to deal with someone else's CSV data.
Thank you for your thoughts and suggestions,
James
Edit: Some suggested information from
TheoPetersen which I very much agree with...
MySQL is not a flat file, it is a relational database. It is good at t
+hings that relational databases do well, such as retrieving related r
+ows from multiple tables at once.
Flat files are not relational databases. They are good at things such
+as writing data sequentially to the end of a table or retrieving data
+ with ad-hoc structure.
Don't use flat files when you need a relational database; you'll spend
+ endless time implementing things that you can have for free by using
+ the right data manager. Similarly, avoid making your relational data
+base stand in for a simple flat file unless you will need relational
+capabilities on the data stored in the file later.