Re: faster with threads?
by Zaxo (Archbishop) on Jun 09, 2004 at 09:06 UTC
|
Both reading the log and updating the database are apt to be I/O bound, so threads or a forked process make sense (particularly if the log and db are on different spindles). I'm not too familiar with perl threads, so lets fork the db updater with a pipe from the parent to ship data over,
sub update_db {
# ...
}
sub mung_logline {
# ...
}
pipe my($rd, $wr);
my $cpid;
{
$cpid = fork;
die $! if not defined $cpid;
unless ($cpid) {
close $wr;
my $dbh = connect(,,,);
while (<$rd>) {
update_db( $dbh, $_);
}
close $rd;
exit 0;
}
}
close $rd;
{
open my $fh, '<', '/path/to/log.file' or die $!;
while (<$fh>) {
$_ = mung_logline($_);
print $wr $_;
}
close $fh;
}
close $wr;
That leaves out your desire to accumulate a large instruction, possible autoflushing of the pipe, @SIG{'CHLD','PIPE} handling and/or wait, and much error handling. Take it as a skeleton.
It may be best to produce a $sth = $dbh->prepare('whatever ?') with placeholders right after the $dbh is obtained. Then you can pass the $sth already set up to update_db, instead of $dbh.
| [reply] [d/l] |
|
One thing to watch out for is DBI thread/fork safe?
I seem to recall having problems with database connections 'going away' when using database handle inside forked off child processes. I resolved this by each child process recreating it's own database handle.
Another thing to watch is maybe if the DB and log reside on the same disk then you could have decreased performance if doing major I/O on both at once, Hence maybe try and read the whole log into memory
| [reply] |
|
The issue with DBI and forking is this: You need to set $dbh->{InactiveDestroy} = 1 in all of your child processes. Otherwise, when one child process dies it may kill database connections in other processes. (This is because as part of the DESTROY for the handle, it tells the databse server that it is done with the connection... even if there's another process that is not done with it)
Also be warned of the greater conceptual issue that you can only use a DB connection in one process. The connection may exist in other processes (hence the warning above about setting InactiveDestroy), but it can only be used in one.
Also, no, DBI is not threadsafe (unless this has changed recently... and that would be a big deal).
------------
:Wq
Not an editor command: Wq
| [reply] [d/l] [select] |
|
Well, I've tried to solve this with this concept, and doesn't work for me. After a some time, the connections was broken and the process stays inactive. Why?
Cheers!
| [reply] |
Re: faster with threads?
by BrowserUk (Patriarch) on Jun 09, 2004 at 09:30 UTC
|
My first reaction is that I'd use 3 threads.
- Reads the file and feeds Q1.(Thread::Queue)
- Reads Q1, processes the records and feeds Q2 with the data for updating the DB.
- Reads Q2, and fires off the updates.
I'd use a high/low water marking system to allow threads to yield their timeslices once the queue they were feeding moved above the high water mark, until it dropped back to the low watermark. Reading threads would yield when their input queues dropped below low water.
Then run some benchmarks to establish the best values for the watermarks.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |
|
Why do all that stuff with high and low marks, and yielding threads, if the OS can do it for you (and probably a lot more efficiently)? Just use three processes and pipes. Writing to a pipe whose buffer is full will block - causing the process to give up its timeslice - and so will reading from a pipe whose buffer is empty.
Abigail
| [reply] |
|
The idea is to ensure that the cpu bound part of the process never has to wait for data and so uses as many timeslices available to it as possible.
The water marks allow you to easily tailor the threading to maximise throughput.
Using the queues makes it easy to have more than one thread processing the slow part(s) of the processing. Each thread is identical, you just start more of them. They all read their input from the same queue. You don't get this easy flexibility using pipes.
If the processing of the data is the bottleneck you start two threads for that. If outputting to the DB is the bottleneck, have two threads doing that.
If the DB is running in the same box (with 2 cpu's) then it will likely dominate one of them and all the threads will basically share the other. If the DB is on a different box, then the cpu-bound thread may dominate one process and the IO/DB threads share the other.
The yielding should rarely come into play once you get the right watermark levels established, but it acts as a safeguard for the situations where either the IO or DB slows up--someone does a grep on the disk or hits the DB with a heavy query. It prevents the Q from filling memory whilst the processing at the other end is blocked.
The reason I would try threads are:
- I'm more familiar with the threading model (forking is only threading under the covers, and without the control, where I live).
- I think that IPC through shared memory is more convenient and easier to program that through the flat stream of a pipe.
- You can share structured data using threads. I'm not yet certain if it is up to large scale production use, but it is much improved in 5.8.3.
This final point is quite important with the OP's application. Basically he is reading lines, splitting them into chunks, and then throwing them into a DB. The DB IO is quite likely to be the slowest part of the overall processing.
If having split the lines into chunks, he then has to serialise those chunks to pass them through a pipe to the DB process, he hasn't gained anything by splitting out the DB process.
He would then have to deserialise it and the serialisation/deserialisation is likely to take much the same amount of time as the splitting, which negates the reason for having a separate process for the DB IO.
I can't honestly say whether my thoughts would result in faster overall processing. There are too many factors involved. I don't have a dual processor machine to test on. There are many details that the OP hasn't supplied: where is the DB? How much indexing is on the DB? Is the DB shared with other applications? etc.
Until someone actually tries some of this stuff using threads, nobody knows how it will stand up. Until recently, memory leaks prevented any worthwhile testing. With 5.8.3, that seems to be getting much better to the point where it is now worth trying stuff out again.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |
Re: faster with threads?
by Abigail-II (Bishop) on Jun 09, 2004 at 10:55 UTC
|
Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?
Perhaps, this will depend on various things. It'll depend on the threading implementation on the OS that's being used, but it will also depend on the sceduling policy of your OS. If for instance your database is running on the same machine, its process might use enough CPU time to occupy a CPU by its own.
Threads may be an answer if you can split your CPU bounded tasks into two - you only have one CPU bound task (the actual parsing), but that can easily be split into two - have each thread deal with half a file. However, instead of using two threads, it's much easier to use two processes (depending on your OS, you might even get more CPU time slots that way), and you save the not insignificant overhead Perl threads will give you. Now you might want to use separate threads to do your I/O as well. But that is something you should only consider if it's an actual bottleneck - and if it is, you can use separate processes as well, and have them communicate with pipes.
Abigail
| [reply] |
Re: faster with threads?
by paulbort (Hermit) on Jun 09, 2004 at 13:18 UTC
|
| [reply] |
|
There is a bulk insert for mysql (LOAD DATA INFILE)
and I completely agree
with you on the speed issue. Threading seems to buy nothing
but pain in this case.
| [reply] |
Re: faster with threads?
by eserte (Deacon) on Jun 09, 2004 at 09:39 UTC
|
Also this script will run on a machine with 2 cpus. Will the threading enable it to use both cpu's?
This depends on the threading implementation of your operating system. I believe that on FreeBSD 4.x it won't (unless you use the linuxthreads package), while on Linux it will.
| [reply] |
|
Check your kernel if you are using the smp linux kernel theres a good chance it will use both processors. Also check if your db is setup to use threads
| [reply] |
Re: faster with threads?
by dragonchild (Archbishop) on Jun 09, 2004 at 14:02 UTC
|
What I found is that RedHat9 will allocate one CPU to do computation and the other as a glorified IO manager. This is without threading. Personally, I would see if doing it without threading is fast enough. If it is, who cares if you can make it faster??
Remember - a requirement that most people forget to clarify is "How fast is fast enough?". The pursuit of perfection is harmful to good deliverables.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] |
Re: faster with threads?
by CountZero (Bishop) on Jun 09, 2004 at 13:15 UTC
|
If you really have such a large amount of data to insert, I would go the easy way and use the LOAD statement of MySQL.Then all your script has to do is, put together the "load-file" and hand it over to the MySQL server, who will do all the heavy lifting (and faster than you can do with multiple INSERT statements).
CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law
| [reply] |
Re: faster with threads?
by Ryszard (Priest) on Jun 09, 2004 at 12:27 UTC
|
| [reply] |
|
Use the mysql bulk insert statement format if possible. We found that using the traditional prepare/execute cycle was about 10 times slower to insert 2 million records, than using several bulk sql statements of about 500K.
my @rows = (
[qw(a b c d)],
[qw(e f g h)],
[qw(i j k l)],
[qw(m n o p)],
[qw(q r s t)],
[qw(u v w x)],
[qw(y z a b)],
[qw(c d e f)],
[qw(h i j k)],
[qw(l m n o)],
);
# -------------------------------------
# slower
# -------------------------------------
my $sth = $db->prepare("INSERT INTO mytable VALUES(?,?,?,?)")
or die "prepare failed: " . $db->errstr;
for my $row ( @rows ) {
$sth->execute(@$row);
}
# -------------------------------------
# faster
# -------------------------------------
my $sql = "INSERT INTO mytable VALUES ";
for my $row ( @rows ) {
$sql .= "('" . join("','",@$row) . '),";
}
substr($sql,-1)='';
$db->do( $sql );
| [reply] [d/l] |
|
And if using a bulk insert isn't possible for some reason, first drop the indices on the table(s) involved, insert your data, then rebuild the indices. Unless the tables already contain billions of rows, rebuilding the indices is faster than modifying them millions of times.
Abigail
| [reply] |
Re: faster with threads?
by cyocum (Curate) on Jun 09, 2004 at 11:03 UTC
|
Everyone's answers to your questions are great as usual. The only thing I can add is that the documentation for Perl's threading is in perlthrtut for later versions of perl (5.6.0 >=) and the earlier version (5.005 >= ver < 5.6.0) perlothrtut, which I cannot seem to find on perldoc.com right now. I hope this helps!
| [reply] |
Re: faster with threads?
by bsb (Priest) on Jun 10, 2004 at 03:51 UTC
|
I don't see why the added complexity of threads
is worth it for this case.
Why does the speed matter?
You're better off keeping your code simple and leveraging
OS scheduling, pipes and the databases bulk load utils. | [reply] |