http://qs321.pair.com?node_id=362651

js1 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a script to parse a proxy log (node 362548) and I'm planning to use DBI to update a mysql database when the sql insert command reaches 2MB (node 362109).

In order to run this script as quickly as possible would I be right to run the database update whilst I carry on parsing the log? I presume I need some sort of thread to do this (not something I've done before in perl).

Could someone tell me if I'm on the right track here and if not point me in the right direction? If threads are the way to go where would be the best place to start?

Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?

Thanks a lot,

js.

Edit by BazB: changed href links into Perlmonks node links.

Replies are listed 'Best First'.
Re: faster with threads?
by Zaxo (Archbishop) on Jun 09, 2004 at 09:06 UTC

    Both reading the log and updating the database are apt to be I/O bound, so threads or a forked process make sense (particularly if the log and db are on different spindles). I'm not too familiar with perl threads, so lets fork the db updater with a pipe from the parent to ship data over,

    sub update_db { # ... } sub mung_logline { # ... } pipe my($rd, $wr); my $cpid; { $cpid = fork; die $! if not defined $cpid; unless ($cpid) { close $wr; my $dbh = connect(,,,); while (<$rd>) { update_db( $dbh, $_); } close $rd; exit 0; } } close $rd; { open my $fh, '<', '/path/to/log.file' or die $!; while (<$fh>) { $_ = mung_logline($_); print $wr $_; } close $fh; } close $wr;
    That leaves out your desire to accumulate a large instruction, possible autoflushing of the pipe, @SIG{'CHLD','PIPE} handling and/or wait, and much error handling. Take it as a skeleton.

    It may be best to produce a $sth = $dbh->prepare('whatever ?') with placeholders right after the $dbh is obtained. Then you can pass the $sth already set up to update_db, instead of $dbh.

    After Compline,
    Zaxo

      One thing to watch out for is DBI thread/fork safe?

      I seem to recall having problems with database connections 'going away' when using database handle inside forked off child processes. I resolved this by each child process recreating it's own database handle.

      Another thing to watch is maybe if the DB and log reside on the same disk then you could have decreased performance if doing major I/O on both at once, Hence maybe try and read the whole log into memory

        The issue with DBI and forking is this: You need to set $dbh->{InactiveDestroy} = 1 in all of your child processes. Otherwise, when one child process dies it may kill database connections in other processes. (This is because as part of the DESTROY for the handle, it tells the databse server that it is done with the connection... even if there's another process that is not done with it)

        Also be warned of the greater conceptual issue that you can only use a DB connection in one process. The connection may exist in other processes (hence the warning above about setting InactiveDestroy), but it can only be used in one.

        Also, no, DBI is not threadsafe (unless this has changed recently... and that would be a big deal).

        ------------ :Wq Not an editor command: Wq
        Necroshine

        Well, I've tried to solve this with this concept, and doesn't work for me. After a some time, the connections was broken and the process stays inactive. Why?


        Cheers!
Re: faster with threads?
by BrowserUk (Patriarch) on Jun 09, 2004 at 09:30 UTC

    My first reaction is that I'd use 3 threads.

    1. Reads the file and feeds Q1.(Thread::Queue)
    2. Reads Q1, processes the records and feeds Q2 with the data for updating the DB.
    3. Reads Q2, and fires off the updates.

    I'd use a high/low water marking system to allow threads to yield their timeslices once the queue they were feeding moved above the high water mark, until it dropped back to the low watermark. Reading threads would yield when their input queues dropped below low water.

    Then run some benchmarks to establish the best values for the watermarks.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      Why do all that stuff with high and low marks, and yielding threads, if the OS can do it for you (and probably a lot more efficiently)? Just use three processes and pipes. Writing to a pipe whose buffer is full will block - causing the process to give up its timeslice - and so will reading from a pipe whose buffer is empty.

      Abigail

        The idea is to ensure that the cpu bound part of the process never has to wait for data and so uses as many timeslices available to it as possible.

        The water marks allow you to easily tailor the threading to maximise throughput.

        Using the queues makes it easy to have more than one thread processing the slow part(s) of the processing. Each thread is identical, you just start more of them. They all read their input from the same queue. You don't get this easy flexibility using pipes.

        If the processing of the data is the bottleneck you start two threads for that. If outputting to the DB is the bottleneck, have two threads doing that.

        If the DB is running in the same box (with 2 cpu's) then it will likely dominate one of them and all the threads will basically share the other. If the DB is on a different box, then the cpu-bound thread may dominate one process and the IO/DB threads share the other.

        The yielding should rarely come into play once you get the right watermark levels established, but it acts as a safeguard for the situations where either the IO or DB slows up--someone does a grep on the disk or hits the DB with a heavy query. It prevents the Q from filling memory whilst the processing at the other end is blocked.

        The reason I would try threads are:

        1. I'm more familiar with the threading model (forking is only threading under the covers, and without the control, where I live).
        2. I think that IPC through shared memory is more convenient and easier to program that through the flat stream of a pipe.
        3. You can share structured data using threads. I'm not yet certain if it is up to large scale production use, but it is much improved in 5.8.3.

        This final point is quite important with the OP's application. Basically he is reading lines, splitting them into chunks, and then throwing them into a DB. The DB IO is quite likely to be the slowest part of the overall processing.

        If having split the lines into chunks, he then has to serialise those chunks to pass them through a pipe to the DB process, he hasn't gained anything by splitting out the DB process.

        He would then have to deserialise it and the serialisation/deserialisation is likely to take much the same amount of time as the splitting, which negates the reason for having a separate process for the DB IO.

        I can't honestly say whether my thoughts would result in faster overall processing. There are too many factors involved. I don't have a dual processor machine to test on. There are many details that the OP hasn't supplied: where is the DB? How much indexing is on the DB? Is the DB shared with other applications? etc.

        Until someone actually tries some of this stuff using threads, nobody knows how it will stand up. Until recently, memory leaks prevented any worthwhile testing. With 5.8.3, that seems to be getting much better to the point where it is now worth trying stuff out again.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
Re: faster with threads?
by Abigail-II (Bishop) on Jun 09, 2004 at 10:55 UTC
    Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?
    Perhaps, this will depend on various things. It'll depend on the threading implementation on the OS that's being used, but it will also depend on the sceduling policy of your OS. If for instance your database is running on the same machine, its process might use enough CPU time to occupy a CPU by its own.

    Threads may be an answer if you can split your CPU bounded tasks into two - you only have one CPU bound task (the actual parsing), but that can easily be split into two - have each thread deal with half a file. However, instead of using two threads, it's much easier to use two processes (depending on your OS, you might even get more CPU time slots that way), and you save the not insignificant overhead Perl threads will give you. Now you might want to use separate threads to do your I/O as well. But that is something you should only consider if it's an actual bottleneck - and if it is, you can use separate processes as well, and have them communicate with pipes.

    Abigail

Re: faster with threads?
by paulbort (Hermit) on Jun 09, 2004 at 13:18 UTC
    If you want this to go faster, I would suggest not worrying about threading, and worry about the inserts instead. Inserts are slow operations on almost any (every?) database.

    I don't know MySQL at all, but every other database I've run into has a "bulk insert" mode or command. If you want this to go fast, do whatever you have to in Perl to get the data into a format that can be bulk inserted. (In PostgreSQL, this would be a simple line transform, since the bulk insert format is plain text and "COPY FROM STDIN" magic.) If MySQL doesn't have a bulk insert method, you might want to reconsider your database choice.

    The other thing that will help is to do the bulk insert to a table that has no indexes, then move the data to an indexed table (using SQL) when the machine has spare time. This could be done with Perl and DBI, or maybe an ON DELETE trigger that then inserts the record in the indexed table.

    Once you're doing bulk inserts, if you are still having performance concerns, then some sort of buffering is appropriate. (both the pipe and queue suggestions above are valid, pick the one you feel most comfortable with.)


    --
    Spring: Forces, Coiled Again!
      There is a bulk insert for mysql (LOAD DATA INFILE) and I completely agree with you on the speed issue. Threading seems to buy nothing but pain in this case.
Re: faster with threads?
by eserte (Deacon) on Jun 09, 2004 at 09:39 UTC
    Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?
    This depends on the threading implementation of your operating system. I believe that on FreeBSD 4.x it won't (unless you use the linuxthreads package), while on Linux it will.
      Check your kernel if you are using the smp linux kernel theres a good chance it will use both processors. Also check if your db is setup to use threads
Re: faster with threads?
by dragonchild (Archbishop) on Jun 09, 2004 at 14:02 UTC
    What I found is that RedHat9 will allocate one CPU to do computation and the other as a glorified IO manager. This is without threading. Personally, I would see if doing it without threading is fast enough. If it is, who cares if you can make it faster??

    Remember - a requirement that most people forget to clarify is "How fast is fast enough?". The pursuit of perfection is harmful to good deliverables.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: faster with threads?
by CountZero (Bishop) on Jun 09, 2004 at 13:15 UTC
    If you really have such a large amount of data to insert, I would go the easy way and use the LOAD statement of MySQL.

    Then all your script has to do is, put together the "load-file" and hand it over to the MySQL server, who will do all the heavy lifting (and faster than you can do with multiple INSERT statements).

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: faster with threads?
by Ryszard (Priest) on Jun 09, 2004 at 12:27 UTC
    Updates are slow, in my experience (at least with Oracle) putting data in a database using inserts is way slow.

    If i were to build your system there are a couple of things i would do:

    1. Research on what you think is the best item.
    2. Develop prototypes for all appropriate options
    3. Run the prototypes against a limited data set
    4. Extrapolate the timings
    5. Develop further the winner.

      Use the mysql bulk insert statement format if possible.

      We found that using the traditional prepare/execute cycle was about 10 times slower to insert 2 million records, than using several bulk sql statements of about 500K.

      my @rows = ( [qw(a b c d)], [qw(e f g h)], [qw(i j k l)], [qw(m n o p)], [qw(q r s t)], [qw(u v w x)], [qw(y z a b)], [qw(c d e f)], [qw(h i j k)], [qw(l m n o)], ); # ------------------------------------- # slower # ------------------------------------- my $sth = $db->prepare("INSERT INTO mytable VALUES(?,?,?,?)") or die "prepare failed: " . $db->errstr; for my $row ( @rows ) { $sth->execute(@$row); } # ------------------------------------- # faster # ------------------------------------- my $sql = "INSERT INTO mytable VALUES "; for my $row ( @rows ) { $sql .= "('" . join("','",@$row) . '),"; } substr($sql,-1)=''; $db->do( $sql );
        And if using a bulk insert isn't possible for some reason, first drop the indices on the table(s) involved, insert your data, then rebuild the indices. Unless the tables already contain billions of rows, rebuilding the indices is faster than modifying them millions of times.

        Abigail

Re: faster with threads?
by cyocum (Curate) on Jun 09, 2004 at 11:03 UTC

    Everyone's answers to your questions are great as usual. The only thing I can add is that the documentation for Perl's threading is in perlthrtut for later versions of perl (5.6.0 >=) and the earlier version (5.005 >= ver < 5.6.0) perlothrtut, which I cannot seem to find on perldoc.com right now. I hope this helps!

Re: faster with threads?
by bsb (Priest) on Jun 10, 2004 at 03:51 UTC
    I don't see why the added complexity of threads is worth it for this case.
    Why does the speed matter?

    You're better off keeping your code simple and leveraging OS scheduling, pipes and the databases bulk load utils.