Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

mysql's join too slow; using Perl to compare two tables

by Andre_br (Pilgrim)
on Feb 04, 2006 at 22:25 UTC ( [id://528005]=perlquestion: print w/replies, xml ) Need Help??

Andre_br has asked for the wisdom of the Perl Monks concerning the following question:

Hello esteemed fellow monks

I hope you guys can help me on this. Hereīs the problem: I have a table named 'catalog', thatīs where the products are kept. Itīs something like this:

table 'catalog' product_id author title label description price vendor_id log
From time to time, my vendors send me a .txt file with their updated inventory, wich has the following columns:
author title label description price

So, thatīs when I have to check wich CDs are new at theis inventory, you know? I could just erase their entire online inventory, and then record the new one theyīve just sent, but this ainīt good because itīs important that the product_ids of the veteran products donīt change, you know?

Then, what I do? I insert the bew data on a table I call 'pivot', wich has the same fields as the ones send me, and also a received_id:

table 'pivot' received_id author title label description price
Then I perform this inner join below, to find out the received_ids for the products that are in fact veteran (wich will be erased from the 'pivot' table before inserting the remaining into the catalog (so there are only new products there)).
select pivot.received_id from catalog inner join pivot on ( catalog.author = pivot.author and catalog.titulo = pivot.titulo and catalog.label = pivot.label and catalog.description = pivot.description and catalog.price = pivot.price );
The problem is this inner join with multiple ons takes too much effort and time - weīre talking about a 100k records inner join! I was looking for another way to compare these two tables without this inner join, but it occurs to me now that maybe I could use Perl instead of mysql.

I could, insted of using this 'pivot' table, load the 'catalog' data into an array - being each element a concatenation like ('thisauthor thistitle thislabel thisdescription thisprice', 'thisanotherauthor thisanothertitle thisanotherlabel thisanotherdescription thisanotherprice', ...) and so on. Then I could do the same with the data the vendor sends me, and then find out the redundant ones with grep. Well, donīt know it this will sound like an heresy, sorry if it does.

What do you think about?

Thanks a lot!

Andre

Replies are listed 'Best First'.
Re: mysql's join too slow; using Perl to compare two tables
by reneeb (Chaplain) on Feb 04, 2006 at 23:07 UTC
    why not
    my $st_select = q~SELECT product_id FROM catalog WHERE author = ? AND title = ? AND label = ? AND description = ? AND price = ?~; my $sth = $dbh->prepare($st_select) or die $dbh->errstr(); for my $arrayref(@data_from_txt){ $sth->execute(@$arrayref) or die $dbh->errstr(); unless($sth->fetchrow_array()){ print "This is not in catalog: ",Dumper($arrayref); } }

    in @data_from_txt all information from the txt with an arrayref for each dataset...
Re: mysql's join too slow; using Perl to compare two tables
by monarch (Priest) on Feb 04, 2006 at 22:59 UTC
    I have personally found myself somewhat disgusted with MySQL performance when it comes to simple joins. I don't know if this is because I didn't index the tables properly of some inherent disability in MySQL.

    Postgres, on the other hand, seems to take the same time to do inner, outer, or whatever style joins with no performance penalty.

    So, no, it's not wrong to scan each table one at a time and getting perl to do the intelligent work with MySQL.

      I don't know if this is because I didn't index the tables properly...

      That's probably it. Proper indexing is essential with any non-trivial amount of data.

      Postgres, on the other hand, seems to take the same time to do inner, outer, or whatever style joins with no performance penalty.

      ... Unless you don't have the necessary columns indexed, in which case performance can appear to suck mightily, particularly once you've left a small, tidy test environment and start dealing with huge piles of customer data.

Re: mysql's join too slow; using Perl to compare two tables
by srdst13 (Pilgrim) on Feb 05, 2006 at 01:51 UTC

    Probably the simplest way to do this is to do a select using your "new" data minus any information that might have changed (like the price). Then, if you get a result back, do an update. If not, then do an insert. Pretty simple, I think.

    As for the speed issue, I'm fairly sure that this is an indexing issue. I routinely do inner joins of tables with 60 million rows in mysql on a web-based app. For best performance, any column that appears in a where clause or a join clause should be indexed, in general.

    Sean
Re: mysql's join too slow; using Perl to compare two tables
by converter (Priest) on Feb 05, 2006 at 15:27 UTC

    Are you using indices, and if so, are you certain that the database is doing index scans instead of sequential scans in your query? Before you give up on MySQL and try to solve the problem with Perl, you should have a look at the EXPLAIN and ANALYZE statements.

    Since you didn't mention which version of MySQL you're using, I'll take a guess and include Optimizing Queries with EXPLAIN for versions 3.23 through 4.1.

Re: mysql's join too slow; using Perl to compare two tables
by JamesNC (Chaplain) on Feb 05, 2006 at 16:53 UTC
    I have to agree with friedo. This sounds like an index issue. What kind of indexes do you have on catalog and pivot? If you don't know what I am talking about, then you need to educate yourself on indexes.
    Here is how you would create some indexes on these two tables. Enter these in the mysql command line tool.
    create index c_atldp on catalog ( author, titulo, label, description, +price ) create index a_atldp on author ( author, titulo, label, description, p +rice )

    You can create several different types of indexes depending on how and what kind of queries you are writing and a full discussion of indexes is OT. Sometimes you need to write new ones for a new type of query.
    JamesNC
      Ok! I've implemented the indexes and things are now warp speed! Mysql is great, I was the one screwing up... lghs. Also because I was imposing a conversion to binary of all the fields, to make sure the different cases werenīt ignored when comparing similar records:

      select pivot.received_id from catalog inner join pivot on ( binary catalog.author = binary pivot.author and binary catalog.titulo = binary pivot.titulo and binary catalog.label = binary pivot.label and binary catalog.description = binary pivot.description and binary catalog.price = binary pivot.price );

      Iīve put the indexes and things didnīt get any better. Then I decided to get less perfectionist, and took away all the binaries... ok, mysql, no problem, ths isnīt really thaaaat necesary for me. Well, then the join that was taking 3 minutes, surprised me with the output screen in, guess what, 1 second!

      Thanks for all the inputs, my fellow coders.And sorry for not having mentioned this 'binary' issue, as I thought it might not be it. But as you directed me to the only one thing that could explain the slowness, and it didnīt solve it, I could start suspecting about the binary.

      Take care

      Andre

Re: mysql's join too slow; using Perl to compare two tables
by badaiaqrandista (Pilgrim) on Feb 05, 2006 at 23:10 UTC
    I think you should try to use INSERT ON DUPLICATE KEYS UPDATE statement. It checks if the data you want to insert is already there, if it is, update the columns instead of creating new record. With this you don't need to do any select query at all and it (theoritically) will increase performance because you only do one INSERT query. The only thing you need is to create a unique index for the fields you mention (author, etc).
    --------------
    badaiaqrandista
      I hadn't seen this before. Nice little trick.
      Sean
Re: mysql's join too slow; using Perl to compare two tables
by TedPride (Priest) on Feb 05, 2006 at 13:15 UTC
    Do a SELECT for all the existing data. Convert the fields to some standard format (in this case, lowercase tab delimited is probably fine) and create a hash for fields => id, or fields hash => id. Now all you have to do is take the new data, convert it also to the same format, and check to see if each item is in the hash. If it is, UPDATE using the id. If it isn't, do an INSERT. If memory use is a problem, create a dump of the existing data, do the above processing on your own computer rather than the server, and split the new data into two files, one for UPDATE and one for INSERT. The server will then be able to process the result with minimum effort.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://528005]
Approved by Arunbear
Front-paged by astaines
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (6)
As of 2024-04-20 02:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found