Re: Big database queries

Replies are listed 'Best First'.
Re: Re: Big database queries by ezekiel (Pilgrim) on Aug 30, 2002 at 03:58 UTC
Thanks for your response. In answer to your questions: 1. Yes. The time is in the execution, not the iteration. 2. Yes. 3, 4. I am using DB2. It seem I will have to spend more time looking at things in the database. This is unfortunate. I like to stay on the Perl side of DBI :-)	[reply]
Re: Re: Re: Big database queries by Jeppe (Monk) on Aug 30, 2002 at 16:49 UTC
Staying as much in perl-space as possible is a good idea. One of the in-house apps here had all or most of the business logic implemented in stored procedures. Then, the volume increased so much that they needed a better database server. They are still coding away, after a long time. Several teams from the DB vendor have tried, but not suceeded. Anyhow, `UPDATE table_name SET column1=column1 + 3 WHERE column2 > 2500` [download] might do the trick. It will still be gruelling, though. 60 million transactions is a lot, either way we look at it. I've also had great successes with introducing new indexes. It's incredible what a difference a well-placed index can do when you are dealing with large data sets. Have you tried to your queries through explain?	[reply] [d/l]
Use Materialized Views (Re^4: Big database queries) by Flexx (Pilgrim) on Sep 02, 2002 at 08:57 UTC
Note: This might slip a bit off-topic, and contain some prose (which might still be interesting), the advice is "hidden'" at the end... IMHO, the problem with stored procedures vs. perl is that people quite often mix applications of either. A stored procedure is meant to enhance DBMS, traditionally equipped with a descriptive interface (like DML and SQL), with a procedural interface (like PL/SQL). In a descriptive language like SQL you describe a result set. The better (or cleverer) your description, the better (and faster) you'll have your answer. You tell the processor WHAT you want but not HOW to archive it. In a procedural language (like PL/SQL, perl, C, and most other programming languages we all know well), you describe an algorithm, you give instruction HOW to archive a certain "goal" but the processor actually has no idea WHAT you are leading it to. Now, folks like Oracle, IBM & Co. are usually quite well in optimizing their descriptive processors. Still, when you ask a stupid question, you'll get a stupid answer (no pun intended). When you ask a complicated question, the answer will take longer... Now some people think they are smarter than those database vendors, and start hacking away things in stored procedures, that could be perfectly well described by well-formulated SQL statements. And most of the time, they're actually reproducing indexing or cacheing algorhythms already available in the DBMS. Most of the performance problems on large databases can be solved by applying better indexing. But what are stored procedures good for? They're perfect for trigger procedures, maintenance procedures, database statistics, works in the realm of the DB department, not in the application development teams. And they're good for data optimization procedures (i hope you're silently asking yourself what? now ;). Being the user of a database I don't have control over myself, I know how frustrating it can be to cope with poor database design by those people that simply have to store a myriad of 60 million records (which is easy for the most time, especially if these 60M records don't have to be stored fast or all at once). I had to select just a few records from a large hierarchical table, the SQL statement was two pages long, and packed with ugly things like NOT IN clauses and three-level subqueries. We simply lacked the right views. After fighting the "storage client guys" for almost a year, and lot's of discussion, I proved to them that their design was crap. And I made sure to have the right database folks attenting the meeting to back me up. The table was a slow mover, changes occured seldom in relation to my queries, which were fired against that table rapidly. So I issued an ultimatum to the other team. I demanded that they'd either change the design of the table itself, or they'd create materialized views for me. They chose the latter, since they didn't want to go through all their "precious" client code. A materialized view is a table that's fed with reformatted or regenerated data based on other database tables (thus the name view). Unlike a normal view, this data is not dynamically generated, but is permanently kept, materialized. We simply created a stored procedure, that preemped my processing (flattening out the hierarchy, combine some flags to one, stripping out some abstraction that was relevant to other applications but not mine, etc.) to the point where I could ideally jump in with my perl (CGI) script (and the definition of that point is crucial), and pasted the results in another table. The procedure was automatically triggered whenver some client updated, deleted or inserted records in the source table, to assure that the view was always up to date. Had we chosen to use a clever procedure triggered by the querying client, we could have gained a bit maybe, but it's better to run such a procedure as seldom as possible -- only when data changes, of course sacrificing some harddisk space for preprocessed data. If on the store side, it's also possible to do further optimization. At store time, you still know what you're changing -- you need to update only those records affected by your change, not the whole matierialized view. Note also that this is perfectly transaction secure (assuming that your triggers will delay the commit of the source changes, until your views are updated). To happy-end my story: When the procedures were set up, performance on the store side of the database was, as expected, a bit worse due to the procedures overhead, but the performance of my rapid fire CGI script made several quantum leaps. Finally I was on the fast lane. Needless to say, it felt good to be proven right. ;) So the essence of my advice is this: Shift the complicated processing to that side of the I/O cycle that's slower moving, or where it does the least harm (which for the most applications is the input/store side). In my example at least, it was less trouble to let the few people that fed the tables wait a bit longer than the thousands of people that queried them. Since the load impact of that "a bit" is still determined by how often such a bit is applied... So long, Flexx	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks