Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Mysql, CSV, and flatfiles, a benchmark.

by illitrit (Friar)
on May 07, 2001 at 19:00 UTC ( [id://78517]=perlmeditation: print w/replies, xml ) Need Help??

Recently I became curious if using Mysql for anything that needed to store data was the best way. I decided it was time to learn howto use Benchmark so the following results may be flawed.

The program used to do the benchmarking can be found here due to it's size it is not included in the post. Feel free to point out where my method or program may be flawed.

Tests were done on the following machine:
Linux 2.2.6 i686
dual PIII 500 processors, 256MB Ram
Mysql server version 3.22.25

I tested Mysql, Text::CSV_XS, and flatfile (pipe delimited) storage. The data access methods I tested were insert, select, update, and delete. I'll cover my approach for each below. I am aware that there are ways to optimize flatfile and perhaps CSV access using an index of the row start positions in another file. The Perl Cookbook shows a way to create an index in recipe 8.8 for the curious. I am sure there are other ways to improve access speeds on flatfile and CSV data, however the intent was to test the "basic" methods as I know them.

Setup: I connected to the Mysql database, created a new Text::CSV_XS object, defined my csv file and flatfile. Then I created an array of hashes containing 200 different "row" hashes. Each hash looked like this ($_) is the current count (1..200):
{ ID => $_, name => "person_$_", address => "$_ Any St.", city => "Anytown", state => "AnyPlace", zip => $num }

Each iteration for Insert, Select, Update, and Delete does the appropriate action on EACH of these 200 items, so 500 iterations is actually doing 100000 Inserts, Selects or Updates. Deletes are only done once per item due to the data being destroyed in the process.

Insert: The main item of note for the insert tests was my use of $sth->bind_columns to collect the data. I used this method after reading this node by kudra. For CSV and flatfile inserts were basically just appended text to the end of the file.

Insert Results:
Benchmark: timing 500 iterations of CSV_Insert, Flatfile_Insert, Mysql +_Insert... CSV_Insert: 12 wallclock secs (11.30 usr + 0.16 sys = 11.46 CPU) @ 43 +.63/s (n=500) Flatfile_Insert: 1 wallclock secs ( 1.00 usr + 0.10 sys = 1.10 CPU) + @ 454.55/s (n=500) Mysql_Insert: 59 wallclock secs (31.21 usr + 3.42 sys = 34.63 CPU) @ +14.44/s (n=500)


Select: For select statements in flatfile and CSV I just looped through the lines of the file and compared the "ID" field with the desired "ID" value. All selects were done on the "ID" value as opposed to other values.

Select Results:
Benchmark: timing 500 iterations of CSV_Select, Flatfile_Select, Mysql +_Select... CSV_Select: 7329 wallclock secs (7260.21 usr + 27.88 sys = 7288.09 CPU +) @ 0.07/s (n=500) Flatfile_Select: 14 wallclock secs (10.54 usr + 3.84 sys = 14.38 CPU) + @ 34.77/s (n=500) Mysql_Select: 143 wallclock secs (35.87 usr + 6.15 sys = 42.02 CPU) @ + 11.90/s (n=500)


Update: Updates for flatfile and CSV were done by looping through the lines of the file and copying each line to a .new file while altering the "updated" line before the copy. After the file had been looped through the .new file was renamed as the original. Again our "key" was the "ID" field.

Update Results:
Benchmark: timing 500 iterations of CSV_Update, Flatfile_Update, Mysql +_Update... CSV_Update: 17312 wallclock secs (16920.40 usr + 135.93 sys = 17056.33 + CPU) @ 0.03/s (n=500) Flatfile_Update: 244 wallclock secs (207.73 usr + 31.91 sys = 239.64 C +PU) @ 2.09/s (n=500) Mysql_Update: 263 wallclock secs ( 8.27 usr + 2.63 sys = 10.90 CPU) @ + 45.87/s (n=500)


Delete: Deleting data offered a difficult challenge to me, how do you delete data for more than one iteration since you've obviously destroyed your dataset after the first iteration? I was unable to find a way to do this that I didn't think would skew the results, so deleting was only done once thus the results are likely inaccurate.

Delete Results:
Benchmark: timing 1 iterations of CSV_Delete, Flatfile_Delete, Mysql_D +elete... CSV_Delete: 16 wallclock secs (14.91 usr + 0.30 sys = 15.21 CPU) @ 0 +.07/s (n=1) (warning: too few iterations for a reliable count) Flatfile_Delete: 0 wallclock secs ( 0.18 usr + 0.04 sys = 0.22 CPU) + @ 4.55/s (n=1) (warning: too few iterations for a reliable count) Mysql_Delete: 1 wallclock secs ( 0.01 usr + 0.02 sys = 0.03 CPU) @ +33.33/s (n=1) (warning: too few iterations for a reliable count)


My Conclusions: In all tests it would appear flatfile was the winner with Mysql coming in a close second. CSV appears to be quite slow so I think I will only use it if I'm forced to deal with someone else's CSV data.

Thank you for your thoughts and suggestions,
James
Edit: Some suggested information from TheoPetersen which I very much agree with...
MySQL is not a flat file, it is a relational database. It is good at t +hings that relational databases do well, such as retrieving related r +ows from multiple tables at once. Flat files are not relational databases. They are good at things such +as writing data sequentially to the end of a table or retrieving data + with ad-hoc structure. Don't use flat files when you need a relational database; you'll spend + endless time implementing things that you can have for free by using + the right data manager. Similarly, avoid making your relational data +base stand in for a simple flat file unless you will need relational +capabilities on the data stored in the file later.

Replies are listed 'Best First'.
Re: Mysql, CSV, and flatfiles, a benchmark.
by TheoPetersen (Priest) on May 07, 2001 at 19:35 UTC
    At last someone has done a comprehensive analysis of apples and oranges :) While I won't disagree with your results, I suggest these paragraphs should be included:

    MySQL is not a flat file, it is a relational database. It is good at things that relational databases do well, such as retrieving related rows from multiple tables at once.

    Flat files are not relational databases. They are good at things such as writing data sequentially to the end of a table or retrieving data with ad-hoc structure.

    Don't use flat files when you need a relational database; you'll spend endless time implementing things that you can have for free by using the right data manager. Similarly, avoid making your relational database stand in for a simple flat file unless you will need relational capabilities on the data stored in the file later.

      Thank you for saying what I wanted to say but couldn't form into a coherent bit of text.

      Alot of users do NOT realize the truth of what you say, myself included when I first discovered Mysql and other relational databases. Without your addendum I fear my post may have helped confuse such people more than it helped.

      Thanks, James

      MySQL doesn't do subqueries, nor does it implement commit/rollback (and a host of other features, but I sure would miss these two were I required to use it).

      It's great if you consider it as flatfiles-on-steroids, which is what the gist of this benchmark seems to be about. On the other hand, if you're looking at MySQL as an Oracle-on-the cheap, then there's a good article on why that is a Bad Idea.

      PosgreSQL forever!


      --
      g r i n d e r
Re: Mysql, CSV, and flatfiles, a benchmark.... (The Code)
by illitrit (Friar) on May 07, 2001 at 19:12 UTC
    Here is the code for any that would prefer to view it on-site.

    James

      Definitely not a surprise - PSV (pipe-separated) files are definitely easier to generate/parse than CSV files (and you don't seem to be dealing with the problem of escaping pipes within your data). A more interesting comparison would be to attempt a 'join' between two extremely large PSV files, and compare that to MySql. I'll bet MySql will give you a better run for the money, and memory usage would be better.

      We benchmarked Text::CSV a long time ago at my last job, as we were getting a lot of CSV files from the outside world. It's performance (topped out at 12-14K rows/minute on a p2-400 with 512M) caused us to re-architect our software to minimize the CSV data we were working on. It was quite a surprise at the time, but in retrospect, parsing CSV data can be quite problematic (especially when the damn users don't follow the rules!)

        Definitely not a surprise - PSV (pipe-separated) files are definitely easier to generate/parse than CSV files
        Until you have to deal with a pipe in your data. Then, there's no difference.

        -- Randal L. Schwartz, Perl hacker

        Dropping Text::CSV was a good idea. If your data had any returns in them, it would not have handled that.

        I believe it is slow, but I wouldn't have minded seeing Text::xSV in the above comparison because it does handle that...

Re: Mysql, CSV, and flatfiles, a benchmark.
by koolade (Pilgrim) on May 08, 2001 at 06:45 UTC

    I agree with TheoPeterson's points, and I will add on to warn that there are more important things than speed:

    • MySQL (or other relational databases) usually hold up better when you start increasing the number of records in your tables. Flat files with 1000s of records starts to become very very slow. MySQL can hold up quite well even when you get into 5 digits+. (There have been projects where I've used text files thinking that there'd only be a couple hundred records. Several years later, let's just say that the client exceeded my expectations.)

    • MySQL is a lot easier to work with when your tables need to mutate. If your client decides that they want to add/drop fields, then it's easier to do using a few ALTER statements, rather than hacking scripts together to manually rearrange your database tables. It's better during the development process as well.
    • With flat files you have to be a lot more careful about file locking and data integrity. That's not as much of a concern with most RDBMs.

    • With flat files you have to be a lot more careful about security. With MySQL, etc., you have a lot of flexibility, and can assign access privledges to each action (SELECT,INSERT,DELETE,etc.).

    ...and so on. I know that you weren't using your benchmarks to start a war about MySQL vs flat files, but you did mention that you were looking for the "best way". In my humble opinion, the fastest way isn't the best way in this case. For a lot of projects, messing with flat files or CSVs are just too much of a headache.

    It is good to see the benchmarks though. Thanks for taking the initiative!

Re: Mysql, CSV, and flatfiles, a benchmark.
by petethered (Pilgrim) on May 09, 2001 at 11:28 UTC
    Well, there are some problems with your benchmark. I'm going to ignore CSV since its obviously the slowest.

    First, your method of inserting into the mysql DB is the SLOWEST of the three options:

    Benchmark: timing 500 iterations of dbh_insert, dbh_insert_values, sth +_insert... dbh_insert: 28 wallclock secs ( 5.99 usr + 1.68 sys = 7.67 CPU) @ 65 +.19/s (n=500) dbh_insert_values: 27 wallclock secs ( 6.02 usr + 1.82 sys = 7.84 CP +U) @ 63.78/s (n=500) sth_insert: 59 wallclock secs (28.48 usr + 2.41 sys = 30.89 CPU) @ 16 +.19/s (n=500)

    I used your database format with a small change of adding an index on ID which slows down the inserts.

    For those curious, here are the BM stats of the three insert formats without an index on ID.

    Benchmark: timing 500 iterations of dbh_insert, dbh_insert_values, sth +_insert... dbh_insert: 20 wallclock secs ( 5.30 usr + 1.41 sys = 6.71 CPU) @ 74 +.52/s (n=500) dbh_insert_values: 23 wallclock secs ( 5.62 usr + 1.45 sys = 7.07 CP +U) @ 70.72/s (n=500) sth_insert: 55 wallclock secs (27.53 usr + 2.50 sys = 30.03 CPU) @ 16 +.65/s (n=500)

    Second problem is the actual import subroutines.

    Your import routine is skewed in favor of flatfile since it is simulating a mass import, not periodic imports.

    Your code:

    sub insert_flatfile { # Note: files opened in > mode again to prevent disk space iss +ues open(FF, ">$flatfile") || die "Unable to open '$flatfile': $!" +; foreach(@rows) { print FF join('|', ($_->{ID}, $_->{name}, $_->{address +}, $_->{city}, $_->{state}, $_->{zip})), "\n"; } close(FF); }

    Now, to simulate a production enviorment it should be more like this:

    sub insert_flatfile { # Note: files opened in > mode again to prevent disk space iss +ues open(FF, ">$flatfile") || die "Unable to open '$flatfile': $!" +; close(FF); foreach(@rows) { open(FF, ">>$flatfile") || die "Unable to open '$flatf +ile': $!"; flock(FF,LOCK_EX); print FF join('|', ($_->{ID}, $_->{name}, $_->{address +}, $_->{city}, $_->{state}, $_->{zip})), "\n"; flock(FF,LOCK_UN); close(FF); } }

    Note: There is still a problem with this routine since in a production enviorment, you would either have to read the flatfile to find out the next ID to store ( or another file where you store the next id ) while MySQL could use an auto_increment on the ID column. But for this benchmarks needs its really overkill.

    Note2:If simulating the mass importation of data, MySQL can be greatly speed up using

    insert into table (blah,blah2,blah3) values ('vbla1','vbla2','vbla3'), ('vbla1b','vbla2b','vbla3b'), ect...

    Note3: Depending on how the data is used.. weither it would need to be instantly retrieved or not insert delayed can be used which makes the inserting APPEAR to be near instantanious.

    Here are my results for the benchmarks using my 2 changes WITHOUT an index on ID

    Benchmark: timing 500 iterations of Flatfile_Insert, Mysql_Insert... Flatfile_Insert: 8 wallclock secs ( 4.29 usr + 3.30 sys = 7.59 CPU) + @ 65.88/s (n=500) Mysql_Insert: 24 wallclock secs ( 5.32 usr + 1.30 sys = 6.62 CPU) @ +75.53/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Select, Mysql_Select... Flatfile_Select: 15 wallclock secs ( 7.99 usr + 5.81 sys = 13.80 CPU) + @ 36.23/s (n=500) Mysql_Select: 127 wallclock secs (35.89 usr + 4.20 sys = 40.09 CPU) @ + 12.47/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Update, Mysql_Update... Flatfile_Update: 175 wallclock secs (146.59 usr + 24.77 sys = 171.36 C +PU) @ 2.92/s (n=500) Mysql_Update: 189 wallclock secs (10.51 usr + 1.92 sys = 12.43 CPU) @ + 40.23/s (n=500) ======================================== We can only do 1 iteration of the delete because it necessarily destro +ys our datasource. Benchmark: timing 1 iterations of Flatfile_Delete, Mysql_Delete... Flatfile_Delete: 0 wallclock secs ( 0.17 usr + 0.01 sys = 0.18 CPU) + @ 5.56/s (n=1) (warning: too few iterations for a reliable count) Mysql_Delete: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU) (warning: too few iterations for a reliable count)

    And now with an index on ID:

    Benchmark: timing 500 iterations of Flatfile_Insert, Mysql_Insert... Flatfile_Insert: 8 wallclock secs ( 4.02 usr + 3.36 sys = 7.38 CPU) + @ 67.75/s (n=500) Mysql_Insert: 26 wallclock secs ( 5.74 usr + 1.93 sys = 7.67 CPU) @ +65.19/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Select, Mysql_Select... Flatfile_Select: 14 wallclock secs ( 8.25 usr + 5.39 sys = 13.64 CPU) + @ 36.66/s (n=500) Mysql_Select: 85 wallclock secs (34.61 usr + 5.33 sys = 39.94 CPU) @ +12.52/s (n=500) ======================================== Benchmark: timing 500 iterations of Flatfile_Update, Mysql_Update... Flatfile_Update: 179 wallclock secs (147.56 usr + 25.98 sys = 173.54 C +PU) @ 2.88/s (n=500) Mysql_Update: 37 wallclock secs (10.73 usr + 2.33 sys = 13.06 CPU) @ +38.28/s (n=500) ======================================== We can only do 1 iteration of the delete because it necessarily destro +ys our datasource. Benchmark: timing 1 iterations of Flatfile_Delete, Mysql_Delete... Flatfile_Delete: 0 wallclock secs ( 0.13 usr + 0.04 sys = 0.17 CPU) + @ 5.88/s (n=1) (warning: too few iterations for a reliable count) Mysql_Delete: 0 wallclock secs ( 0.01 usr + 0.01 sys = 0.02 CPU) @ +50.00/s (n=1) (warning: too few iterations for a reliable count)


    EDIT: I was looking at it some more and I decided I didnt think the select routine was the best way of testing either.

    Basically your selecting each row once , just in a random order. A better way to simulate a production enviorment would be to crank up the iterations ( $number_of_rows * 500 for example ) and have the subroutine pick a random row to select from. $a = int(rand($number_of_rows))


    Pete
    insert into pete values('red hair','near green eyes','overinflated ego');

Re: Mysql, CSV, and flatfiles, a benchmark.
by r.joseph (Hermit) on May 08, 2001 at 03:49 UTC
    Great post illitrit! I know you and I have chatted about this particular benchmark ad infinitum, but your post really put it all together - as I have said before, this benchmark is defintely helpful (and makes me feel better :) - good job!

    r. j o s e p h
    "Violence is a last resort of the incompetent" - Salvor Hardin, Foundation by Issac Asimov

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://78517]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (2)
As of 2024-04-26 00:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found