Comparing tables over multiple servers

mnlight has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Comparing tables over multiple servers by Tanktalus (Canon) on Oct 27, 2005 at 14:55 UTC
To be honest, I wouldn't do this in perl. I would do this in the database. You could probably federate the bunch of them and do some sort of SELECT COUNT(*) FROM T1 WHERE PRIMKEY NOT IN (SELECT PRIMKEY FROM T2, T3, T4, T5, T6, T7) or something. But that's probably not really fast either. Instead, invest in replication technology if your vendor doesn't come with any by default. (I know DB2 has replication technology, but I think it's an add-on cost.) And get that vendor to show you how to be convinced that everything is synchronised. If you're doing this for high-availability (doesn't sound like it - 6 fail-over servers seems a bit high), then invest in HA data replication (HADR) and set it up - and, again, ask your vendor to prove it's working (or at least to show you how to prove it's working). I'm sure you could do this in perl. But I wouldn't. And I use perl for almost everything - just ask my teammates.	[reply]
Re: Comparing tables over multiple servers by BrowserUk (Patriarch) on Oct 27, 2005 at 16:18 UTC
Rather than thinking of this as an any to any comparison with the exponential growth that can result,designate one machine as the 'master' and have it send it's table to the other 6 (n) machines, that simply update/replace their copy from the master. No comparison necessary. A straight delete and replace of the table is going to be much quicker than a (multi-way) comparison. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Comparing tables over multiple servers by rnahi (Curate) on Oct 27, 2005 at 19:48 UTC
Would you still recommend this solution if the table size were, say, 100 GB? Think about locking all seven tables until copies are delivered. (If you don't, the table you pass to the slaves may be different from the one that is being modified in the master copy.)	[reply]
Re^3: Comparing tables over multiple servers by BrowserUk (Patriarch) on Oct 27, 2005 at 20:18 UTC
Hmm. If the tables are that large, then you have a problem anyway. No matter how you arrange to do the comparison, it still requires that at least one copy of the table be tranmitted between each pair of machines in order for the comparison to take place. Having transmitted that data, a blanket replacement will always be quicker than a compare. The time spent locked whilst transmitting remains a constant and the time spent locked replacing will pale into insignificance relative to the time spent comparing. If the tables are this large and dynamic (from what the OP said, his seemed to be relatively small & static), then you would obviously need to use some kind of dynamic replication or true distributed updates, but that is quite a different situation and a considerably more involved undertaking than sync'ing a few, small, relatively static reference tables, which was my interpretation of the OP situation. Maintaining multiple copies of large dynamic tables in a distributed DB environment is the current holy grail of RDMBS development, still a proprietary black art for the most part, and hardly the sane subject of a "quick perl script" to undertake. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Comparing tables over multiple servers by rnahi (Curate) on Oct 27, 2005 at 20:24 UTC
Re^5: Comparing tables over multiple servers by BrowserUk (Patriarch) on Oct 27, 2005 at 20:49 UTC
Some notes below your chosen depth have not been shown here
Re: Comparing tables over multiple servers by rnahi (Curate) on Oct 27, 2005 at 16:36 UTC
It depends on the DBMS you are using, but you can get some ideas from Remote MySQL table comparison. It can get you started on the general principles.	[reply]
Re: Comparing tables over multiple servers by Anonymous Monk on Oct 27, 2005 at 20:18 UTC
You're not far off, BrowserUK and mahi's answers to your Q sounds good. I'd like to try to give you an answer but have to ask a Q of my own, -do you have what would be a master DB against which you would compare all others ? Or do you simply wish to find the differences between them and then merge all ? Another Q, if you have differences, what would you like to do with the different data fields, dump them, store them, or intergrate them ??? Or maybe you've already figured this out!	[reply]
Re: Comparing tables over multiple servers by aquarium (Curate) on Oct 27, 2005 at 14:56 UTC
are these html tables?...same number rows/columns?..are you wanting also to check that the html markup is same? give us something. why does it seem like it will take a long time?..have u attempted any code? the hardest line to type correctly is: stty erase ^H	[reply]
Re^2: Comparing tables over multiple servers by mnlight (Scribe) on Oct 27, 2005 at 15:26 UTC
No I have not attempted any code. The issue just came up late yesterday. So I have been contemplating the best way of doing it. I am working on it today. All I want to do is compare the data row by row. The tables are all the same. They are system tables in sybase. We had some issues arise with our roles and want to verify that we are synchronized accross ENV. Each instance of sybase is running on it's own box. I would like this to be something we could use in the future. I guess I see it taking a long time because I would have to build an array for each table and then compare each array to the other e.g. a1 compares to a2, a3, a4, a5, a6, a7 a2 compares to a3, a4, a5, a6, a7 ... If in the future we want to do this with some larger tables it might be really tedious. Just some ideas not looking for new applications just something quick that might have some future value.	[reply]
Re^3: Comparing tables over multiple servers by philcrow (Priest) on Oct 27, 2005 at 15:45 UTC
That comparison list reminds me of a proof technique (from when I once studied/taught math). I think you can compare in a ring. That is compare a1 to a2, then a2 to a3, a3 to a4, ... keep going until you compare a7 to a1. Each step reports any differences. If there are none, everything is good. Only seven comparisons are needed. Phil	[reply]
Re^3: Comparing tables over multiple servers by mpeppler (Vicar) on Oct 27, 2005 at 18:43 UTC
I like BrowserUK's solution. Alternatively - as these are system tables and are fairly small, I'd extract the data (via perl, isql, bcp) making sure that each table is always sorted in the same way, and then use plain old diff to get an idea of what is incorrect. If the problem lies in incorrect role definitions then the number of tables that need to be checked is fairly limited, so doing it this way is reasonable, and can be completed fairly quickly. Michael	[reply]
Re: Comparing tables over multiple servers by Tuppence (Pilgrim) on Oct 28, 2005 at 21:03 UTC
I'm really surprised that noone has mentioned checksumming for this operation. While it would take a lot of processing, I'm pretty sure that any technique to compare tables is going to take a lot of processing power. For some large tables, in memory comparison of actual values will not work at all, as the working set (your entire table size * number of tables to be compared) will not fit into ram + swap. Checksumming will help by providing a unique (or nearly so) value for incoming data, while not having to store the data. For instance, you can generate a MD5 sum of a given string, and it will return 32 characters. If you join the fields of your table together, md5sum that, and compare it to the MD5sums from the other tables, you can find out if they are different. Of course, once you have determined that the rows are different, you will have to reprocess them to find out HOW they are different. This can even be extended further, by MD5ing multiple rows at the same time - which will increase your reprocressing time to find out HOW and WHICH ROW are different, but will further decrease the memory requirements. Taken to obscene levels, you could conceivably make one MD5sum per table, and compare those. If they are the same, the tables are the same. If they are different, the tables are different - but then of course you still need to do more work to find out WHERE the tables are different	[reply]
Re: Comparing tables over multiple servers by RiotTown (Scribe) on Oct 31, 2005 at 05:55 UTC
I can see a couple of different solutions, the best one would be have to be determined by you. Firstly, I'm assuming that at least 1 of these tables would have to be the master table that all data would be referenced against. Secondly, my solutions would depend on the size of the table. Thirdly, being a system table (IMHO) it would probably make more sense to have these scripts notify a DBA, but that is a separate conversation entirely. (You may also want to seriously consider removing all non-select access to system tables except for 1 to 2 DBA types to eliminate this from happening in the future) *also worth noting: With each of these solutions, if your master table gets horked, it cascades down... 1. If the tables are small: Write a script that opens up connections to the 7 tables. Then do a 'select stuff from master table'. While looping through the results, make select calls out to the remaining 6 tables and compare the results (most likely based on primary key). If something is out of wack send notice or fix it. This could also be done on each of the slave dbs making a connection to the master as explained below, just using perl instead of db actions. 2. If the tables are large/very large: Write a script that runs (daily/hourly/etc) that connects from each of the 'slave' databases. This script would connect to the master, then copy the contents of the master system table into a local table on the slave db. You _should_ (this is based on knowledge of oracle and 2 seconds of googling sybase) then be able to create a synonym of the new table to be referenced as the system table (you may also have to drop the system table first). You'd also have to make sure to enable all of the same grants that are in place on the master table before you create the synonym. Hope this helps.	[reply]
Re^2: Comparing tables over multiple servers by mpeppler (Vicar) on Oct 31, 2005 at 06:10 UTC
You may also want to seriously consider removing all non-select access to system tables except for 1 to 2 DBA types to eliminate this from happening in the future You need to have sa_role to be able to write to these tables, and you usually update the tables via system stored procedures. So your (good) suggestion is actually already implemented. In addition, updating the tables directly requires setting a special "allow update to system tables" flag. Michael	[reply]
Re: Comparing tables over multiple servers by hakkr (Chaplain) on Oct 28, 2005 at 14:51 UTC
If you can find them you could just stat the modified date and size of the raw datafiles or compare md5 hashes of the files. Depending on how you can sort the data with sql you may just be able to take a look at the first x rows and/or the last x rows of the table instead of building a complete array. Also depending on your data you could maybe checksum a column.	[reply]


The stupid question is the question not asked
	PerlMonks