Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^4: Comparing tables over multiple servers

by rnahi (Curate)
on Oct 27, 2005 at 20:24 UTC ( [id://503465]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Comparing tables over multiple servers
in thread Comparing tables over multiple servers

it still requires that at least one copy of the table be tranmitted between each pair of machines in order for the comparison to take place

Not necessarily. As this solution suggests, you can calculate a global CRC locally on each server, and then compare the results.

  • Comment on Re^4: Comparing tables over multiple servers

Replies are listed 'Best First'.
Re^5: Comparing tables over multiple servers
by BrowserUk (Patriarch) on Oct 27, 2005 at 20:49 UTC

    Okay. You got me there. Though I do have to wonder about the time spent calculating the CRCs (using SQL) relative to transmitting the data. If the machines are connected by a high-speed (10Mbits/s or greater) network connection, I would think it would be a close run thing.

    It would be quicker if you could CRC the underlying filesystem entities in which the data is stored, but of course, any difference in the transaction histories, database configurations or versions, or even the disk drives would thrown the comparison out.

    And, once you have discovered that there is a difference, you have still to corrected it, and you're right back where you started from needing to transmit the data.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Just think about this: to move a table across the network you can either (a) make a physical copy, and this includes data, indexes, triggers, transaction caches, and whatever the DBMS keeps in store for that table, or (b) copy just the data as SQL statements or as exportable text, but then the receiving DBMS should rebuild the indexes.

      Instead, to calculate a CRC you have just a table scan.

      Therefore you should compare the time needed to calculate a SHA twice per record (one for the record and one to apply it to a global SHA) versus the time needed to pass all the above mentioned stuff over the network.

      I put my bet on SHA. :)

        You would? Not me.

        Did you take a good look at all the data conversions and substrings and stuff going on in that SQL? SQL can be pretty optimal at performing comparison, that's its bread and butter work, but those types of data manipulations and conversions are not it's strong suite.

        I attempted to verify my suspicions, but about half of the syntax in that article doesn't seem to be valid with the only SQL database I have available, but I'm betting (a coffee:) that it ain't quick on any platform.

        I would hazzard that dumping the table using the export facilty and using a dedicated binary digest(or) program would be considerably faster.

        Either way, once the determination of difference is made, you have still to correct it and that means transmitting the data. Easier, surer and possibly quicker to just zip up the dumped table and send it I think.

        Unless the data involved is already compressed binary--jpgs or similar--then the 100GB would probably reduce to 25% or so, and transmitting 25GB at 100Mb/s will take 34 minutes, assuming no contention.

        Running a dedicated md5 executable on 1GB takes around 20 seconds, so around 1/2 hour for 100GB, but that is calculating a single hash from a contiguous datastream.

        You're suggesting calculating 2 hashes for every piece of data, retrieved in iddy biddy chunks and doing all the math in SQL?

        In the absence of evidence to the contrary, my money would be on the transmission finishing long before the checksumming.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://503465]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2024-04-18 18:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found