Well, there are some problems with your benchmark. I'm going to ignore CSV since its obviously the slowest.

First, your method of inserting into the mysql DB is the SLOWEST of the three options:

Benchmark: timing 500 iterations of dbh_insert, dbh_insert_values, sth +_insert... dbh_insert: 28 wallclock secs ( 5.99 usr + 1.68 sys = 7.67 CPU) @ 65 +.19/s (n=500) dbh_insert_values: 27 wallclock secs ( 6.02 usr + 1.82 sys = 7.84 CP +U) @ 63.78/s (n=500) sth_insert: 59 wallclock secs (28.48 usr + 2.41 sys = 30.89 CPU) @ 16 +.19/s (n=500)

I used your database format with a small change of adding an index on ID which slows down the inserts.

For those curious, here are the BM stats of the three insert formats without an index on ID.

Benchmark: timing 500 iterations of dbh_insert, dbh_insert_values, sth +_insert... dbh_insert: 20 wallclock secs ( 5.30 usr + 1.41 sys = 6.71 CPU) @ 74 +.52/s (n=500) dbh_insert_values: 23 wallclock secs ( 5.62 usr + 1.45 sys = 7.07 CP +U) @ 70.72/s (n=500) sth_insert: 55 wallclock secs (27.53 usr + 2.50 sys = 30.03 CPU) @ 16 +.65/s (n=500)

Second problem is the actual import subroutines.

Your import routine is skewed in favor of flatfile since it is simulating a mass import, not periodic imports.

Your code:

sub insert_flatfile { # Note: files opened in > mode again to prevent disk space iss +ues open(FF, ">$flatfile") || die "Unable to open '$flatfile': $!" +; foreach(@rows) { print FF join('|', ($_->{ID}, $_->{name}, $_->{address +}, $_->{city}, $_->{state}, $_->{zip})), "\n"; } close(FF); }

Now, to simulate a production enviorment it should be more like this:

sub insert_flatfile { # Note: files opened in > mode again to prevent disk space iss +ues open(FF, ">$flatfile") || die "Unable to open '$flatfile': $!" +; close(FF); foreach(@rows) { open(FF, ">>$flatfile") || die "Unable to open '$flatf +ile': $!"; flock(FF,LOCK_EX); print FF join('|', ($_->{ID}, $_->{name}, $_->{address +}, $_->{city}, $_->{state}, $_->{zip})), "\n"; flock(FF,LOCK_UN); close(FF); } }

Note: There is still a problem with this routine since in a production enviorment, you would either have to read the flatfile to find out the next ID to store ( or another file where you store the next id ) while MySQL could use an auto_increment on the ID column. But for this benchmarks needs its really overkill.

Note2:If simulating the mass importation of data, MySQL can be greatly speed up using

insert into table (blah,blah2,blah3) values ('vbla1','vbla2','vbla3'), ('vbla1b','vbla2b','vbla3b'), ect...

Note3: Depending on how the data is used.. weither it would need to be instantly retrieved or not insert delayed can be used which makes the inserting APPEAR to be near instantanious.

Here are my results for the benchmarks using my 2 changes WITHOUT an index on ID

Benchmark: timing 500 iterations of Flatfile_Insert, Mysql_Insert... Flatfile_Insert: 8 wallclock secs ( 4.29 usr + 3.30 sys = 7.59 CPU) + @ 65.88/s (n=500) Mysql_Insert: 24 wallclock secs ( 5.32 usr + 1.30 sys = 6.62 CPU) @ +75.53/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Select, Mysql_Select... Flatfile_Select: 15 wallclock secs ( 7.99 usr + 5.81 sys = 13.80 CPU) + @ 36.23/s (n=500) Mysql_Select: 127 wallclock secs (35.89 usr + 4.20 sys = 40.09 CPU) @ + 12.47/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Update, Mysql_Update... Flatfile_Update: 175 wallclock secs (146.59 usr + 24.77 sys = 171.36 C +PU) @ 2.92/s (n=500) Mysql_Update: 189 wallclock secs (10.51 usr + 1.92 sys = 12.43 CPU) @ + 40.23/s (n=500) ======================================== We can only do 1 iteration of the delete because it necessarily destro +ys our datasource. Benchmark: timing 1 iterations of Flatfile_Delete, Mysql_Delete... Flatfile_Delete: 0 wallclock secs ( 0.17 usr + 0.01 sys = 0.18 CPU) + @ 5.56/s (n=1) (warning: too few iterations for a reliable count) Mysql_Delete: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count)

And now with an index on ID:

Benchmark: timing 500 iterations of Flatfile_Insert, Mysql_Insert... Flatfile_Insert: 8 wallclock secs ( 4.02 usr + 3.36 sys = 7.38 CPU) + @ 67.75/s (n=500) Mysql_Insert: 26 wallclock secs ( 5.74 usr + 1.93 sys = 7.67 CPU) @ +65.19/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Select, Mysql_Select... Flatfile_Select: 14 wallclock secs ( 8.25 usr + 5.39 sys = 13.64 CPU) + @ 36.66/s (n=500) Mysql_Select: 85 wallclock secs (34.61 usr + 5.33 sys = 39.94 CPU) @ +12.52/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Update, Mysql_Update... Flatfile_Update: 179 wallclock secs (147.56 usr + 25.98 sys = 173.54 C +PU) @ 2.88/s (n=500) Mysql_Update: 37 wallclock secs (10.73 usr + 2.33 sys = 13.06 CPU) @ +38.28/s (n=500) ======================================== We can only do 1 iteration of the delete because it necessarily destro +ys our datasource. Benchmark: timing 1 iterations of Flatfile_Delete, Mysql_Delete... Flatfile_Delete: 0 wallclock secs ( 0.13 usr + 0.04 sys = 0.17 CPU) + @ 5.88/s (n=1) (warning: too few iterations for a reliable count) Mysql_Delete: 0 wallclock secs ( 0.01 usr + 0.01 sys = 0.02 CPU) @ +50.00/s (n=1) (warning: too few iterations for a reliable count)


EDIT: I was looking at it some more and I decided I didnt think the select routine was the best way of testing either.

Basically your selecting each row once , just in a random order. A better way to simulate a production enviorment would be to crank up the iterations ( $number_of_rows * 500 for example ) and have the subroutine pick a random row to select from. $a = int(rand($number_of_rows))


Pete
insert into pete values('red hair','near green eyes','overinflated ego');


In reply to Re: Mysql, CSV, and flatfiles, a benchmark. by petethered
in thread Mysql, CSV, and flatfiles, a benchmark. by illitrit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.