Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Big database queries

by dreadpiratepeter (Priest)
on Aug 30, 2002 at 00:49 UTC ( [id://193959]=note: print w/replies, xml ) Need Help??


in reply to Big database queries

A couple of thoughts:

1. have you profiled the code? Is the time in the execute or the iteration?
2. If it is in the execute, are the tables properly indexed?
3. You don't say which DB you are using, some have statustics gathering agents that optimize querys. For example, Informix allows you to update statistics on tables and columns. I've have seen this process reduce query time from heat death of the universe down to 3-5 seconds for a complex join or view.
4. Most commercial DBs have an explain plan feature, which tells you how the query optimizer will try to quickly return data, and can be tuned.

5. If it is the iteration that is taking the time, I would suggest using the perl profiler to optimize the code in the loop.

Hope some of this helps,

-pete
"Pain heals. Chicks dig scars. Glory lasts forever."

Replies are listed 'Best First'.
Re: Re: Big database queries
by ezekiel (Pilgrim) on Aug 30, 2002 at 03:58 UTC

    Thanks for your response. In answer to your questions:

    1. Yes. The time is in the execution, not the iteration.
    2. Yes.
    3, 4. I am using DB2.

    It seem I will have to spend more time looking at things in the database. This is unfortunate. I like to stay on the Perl side of DBI :-)

      Staying as much in perl-space as possible is a good idea. One of the in-house apps here had all or most of the business logic implemented in stored procedures. Then, the volume increased so much that they needed a better database server. They are still coding away, after a long time. Several teams from the DB vendor have tried, but not suceeded.

      Anyhow,

      UPDATE table_name SET column1=column1 + 3 WHERE column2 > 2500
      might do the trick. It will still be gruelling, though. 60 million transactions is a lot, either way we look at it.

      I've also had great successes with introducing new indexes. It's incredible what a difference a well-placed index can do when you are dealing with large data sets. Have you tried to your queries through explain?

        Note: This might slip a bit off-topic, and contain some prose (which might still be interesting), the advice is "hidden'" at the end...

        IMHO, the problem with stored procedures vs. perl is that people quite often mix applications of either.

        A stored procedure is meant to enhance DBMS, traditionally equipped with a descriptive interface (like DML and SQL), with a procedural interface (like PL/SQL).

        In a descriptive language like SQL you describe a result set. The better (or cleverer) your description, the better (and faster) you'll have your answer. You tell the processor WHAT you want but not HOW to archive it.

        In a procedural language (like PL/SQL, perl, C, and most other programming languages we all know well), you describe an algorithm, you give instruction HOW to archive a certain "goal" but the processor actually has no idea WHAT you are leading it to.

        Now, folks like Oracle, IBM & Co. are usually quite well in optimizing their descriptive processors. Still, when you ask a stupid question, you'll get a stupid answer (no pun intended). When you ask a complicated question, the answer will take longer...

        Now some people think they are smarter than those database vendors, and start hacking away things in stored procedures, that could be perfectly well described by well-formulated SQL statements. And most of the time, they're actually reproducing indexing or cacheing algorhythms already available in the DBMS. Most of the performance problems on large databases can be solved by applying better indexing. But what are stored procedures good for? They're perfect for trigger procedures, maintenance procedures, database statistics, works in the realm of the DB department, not in the application development teams.

        And they're good for data optimization procedures (i hope you're silently asking yourself what? now ;).

        Being the user of a database I don't have control over myself, I know how frustrating it can be to cope with poor database design by those people that simply have to store a myriad of 60 million records (which is easy for the most time, especially if these 60M records don't have to be stored fast or all at once). I had to select just a few records from a large hierarchical table, the SQL statement was two pages long, and packed with ugly things like NOT IN clauses and three-level subqueries. We simply lacked the right views.

        After fighting the "storage client guys" for almost a year, and lot's of discussion, I proved to them that their design was crap. And I made sure to have the right database folks attenting the meeting to back me up.

        The table was a slow mover, changes occured seldom in relation to my queries, which were fired against that table rapidly. So I issued an ultimatum to the other team. I demanded that they'd either change the design of the table itself, or they'd create materialized views for me. They chose the latter, since they didn't want to go through all their "precious" client code. A materialized view is a table that's fed with reformatted or regenerated data based on other database tables (thus the name view). Unlike a normal view, this data is not dynamically generated, but is permanently kept, materialized.

        We simply created a stored procedure, that preemped my processing (flattening out the hierarchy, combine some flags to one, stripping out some abstraction that was relevant to other applications but not mine, etc.) to the point where I could ideally jump in with my perl (CGI) script (and the definition of that point is crucial), and pasted the results in another table. The procedure was automatically triggered whenver some client updated, deleted or inserted records in the source table, to assure that the view was always up to date. Had we chosen to use a clever procedure triggered by the querying client, we could have gained a bit maybe, but it's better to run such a procedure as seldom as possible -- only when data changes, of course sacrificing some harddisk space for preprocessed data. If on the store side, it's also possible to do further optimization. At store time, you still know what you're changing -- you need to update only those records affected by your change, not the whole matierialized view. Note also that this is perfectly transaction secure (assuming that your triggers will delay the commit of the source changes, until your views are updated).

        To happy-end my story: When the procedures were set up, performance on the store side of the database was, as expected, a bit worse due to the procedures overhead, but the performance of my rapid fire CGI script made several quantum leaps. Finally I was on the fast lane. Needless to say, it felt good to be proven right. ;)

        So the essence of my advice is this: Shift the complicated processing to that side of the I/O cycle that's slower moving, or where it does the least harm (which for the most applications is the input/store side). In my example at least, it was less trouble to let the few people that fed the tables wait a bit longer than the thousands of people that queried them. Since the load impact of that "a bit" is still determined by how often such a bit is applied...

        So long,
        Flexx

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://193959]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-04-25 23:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found