Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Fetch Mysql huge database

by ym_chaitu (Initiate)
on Sep 29, 2011 at 05:46 UTC ( [id://928465]=perlquestion: print w/replies, xml ) Need Help??

ym_chaitu has asked for the wisdom of the Perl Monks concerning the following question:

Hai i have a huge table containing around 34748806 rows out of which there are duplicates as well so i need to get each and every value and check whether the value is existing in other table. But when i run the fetchrow_array value it takes too much of time(more than 18 hrs). So can any one suggest me with more fast method on how to do it. here is the snippet of the query which i have run
$data{RBS_query} = qq~select UserName from table where UserName <> '' +and UserName not like 'deleted%' order by UserName~; #print "Query--> ".$data{RBS_query}."\n"; $data{cur} = $data{dbh}->prepare($data{RBS_query}); $data{cur}->execute or die($data{dbh}->errstr); $data{num} = $data{cur}->rows; print "RBS Rows--> ".$data{num}."\n"; while(my @rows = $data{cur}->fetchrow_array()){ #print "Username --> ".$rows[0]."\n"; if(grep {$_ eq $rows[0]} @RBS_User_Names){ # print "Value exist in array\n"; }else{ push(@RBS_User_Names,$rows[0]); #print "value doesnot exist in array\n"; } }
i have tried to use the distinct also in mysql query but still it was of no use. the program was running but no result was displaying even after long time. Any help would be grately appreciated. Thank you Regards Chaitanya

Replies are listed 'Best First'.
Re: Fetch Mysql huge database
by BrowserUk (Patriarch) on Sep 29, 2011 at 06:38 UTC

    For bulk processing like this, I'd sidestep DBI and use command lines tools.

    Create a file called: getnames.sql containing:

    select UserName from table where UserName <> '' and UserName not like +'deleted%'

    And two perl scripts. 1) partNames.pl:

    #! perl -w use strict; my %fhs; while( <> ) { my $char = substr $_, 0, 1; if( not exists $fhs{ $char } ) { open $fhs{ $char }, '>', "names.$char" or die $!; } print { $fhs{ $char } } $_; }

    And 2) uniqNames. pl:

    #! perl -w use strict; for my $fname ( glob 'names.*' ) { my %uniq; open my $fh, '<', $fname or die $!; while( <$fh> ) { print unless exists $uniq{ $_ }; $uniq{ $_ } = 1; } }

    Then run the following pair of commands:

    mysql --quick < getnames.sql | perl partNames.pl perl uniqNames.pl > uniqedNames.txt

    The first line dumps the names and pipes them to the perl script that then writes them to a set of files (called names.?, where ? is the first character of the name).

    The second script reads those files one at a time and writes the unique names to stdout where they are directed into a file called uniqedNames.txt.

    The whole (untested) process should take less time than sorting the 34 million names -- inside the DB or with your system sort utility. I'd hazard a guess that it would take less than 20 minutes.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hai Thank you for taking time and giving me a reply. But the script which as given here didn't showed the expeted results. i have changed your script and worked on a test database first it gave one error in the partNames.pl so i have changed the my $c =  substr $, 0, 1; line to my $c =  substr $_, 0, 1; something like this. then it started working but it stored all the values from the table into one file called as "names" even with the column name and the duplicate values. then when i ran the uniqNames.pl script i didnt got any result the file created is of 0KB. can you kindly help me on this. Once again Thank you for your reply. Pardon me if there are any mistakes
        didn't showed the expeted results.

        I did warn that the code was untested.

        i have changed the my $c = substr $, 0, 1; line to my $c = substr $_, 0, 1;

        Correct correction :)

        but it stored all the values from the table into one file called as "names"

        I assigned the first char to a variable $c, but then used $char everywhere else.

        You should have received a message: Name "main::char" used only once: possible typo ...?

        I've updated the post above to correct the errors you've pointed out, but there may be more.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fetch Mysql huge database
by fisher (Priest) on Sep 29, 2011 at 06:22 UTC

    The main idea behind this database thingie is that you do not need to fetch the whole dataset to client. If you do need it, which is a rare case, it is a bad system design. In most cases, however, you need only a small subset of stored data, and your job is to select it properly.

    Second, the mysql server itself can operate too slow in some circumstances -- e.g. no index on 'UserName' field can produce inappropriate time consumption. How many time it takes when you select above dataset from mysql client itself?

    Third, you need a space, a memory to store this dataset locally on your client; did you count how much memory do you need to store your 34748806 rows?

Re: Fetch Mysql huge database
by Neighbour (Friar) on Sep 29, 2011 at 06:35 UTC
    Using SELECT DISTINCT is exactly what you need here though. Since that is running too long, the issue lies at the MySQL database instead. Here are some ideas that might help you.
    Check if there is an index on the column UserName.
    Check if there is enough RAM in the machine where the database is at. If there isn't, MySQL will create a temporary table on disk, which is generally very slow.
    Check if your query is blocked by other queries that operate on that table using SHOW FULL PROCESSLIST on the database.
Re: Fetch Mysql huge database
by Anonymous Monk on Sep 29, 2011 at 11:49 UTC
    Ummm...

    SELECT BIG.primary_key FROM big_table BIG
    LEFT OUTER JOIN other_table OTH ON (BIG.primary_key = OTH.primary_key)
    WHERE OTH.primary_key IS NULL;


    Now you have the primary-keys of all the rows in BIG that do not have a corresponding row in OTH. You didn't write a single line of Perl code. Not a single row (except the solution) ever left the confines of the SQL server. The database did all the work. If you need to test other combinations of columns, build an index on those column-sets in both tables first.
Re: Fetch Mysql huge database
by rdfield (Priest) on Sep 29, 2011 at 15:02 UTC
    getting rid of duplicates too:
    select username from table where username != '' and username not like 'deleted%' and username not in (select username from other_table) group by username having count(*) > 1;
    I've never used mysql, but in other RDBMS systems I would check that enough memory has been assigned to buffer cache, sort area, hash sort area, etc. And by "enough", I mean "lots".

    Joining two 34M rows on a decent enough server shouldn't take long. Nested iterations over 2 large tables does take a long time.

    Having indexes on username on both tables would help a lot too

    rdfield

      Indexes would be quite mandatory, to the point that some DBMSes will temporarily create indexes in order to do such a join, or refuse to do the join.

      Even if you do wind up creating indexes only to immediately drop them, or if you set up an index that is “non-maintained” and it must be refreshed now and then, c’est la guerre.   Just a cost of doing business.

      (As of course you surely know... but just for the benefit of any onlookers who maybe don’t...) Left and right so-called “outer” joins are those which cause all of the rows in the table on the left or the right side of the relationship to appear even if they have no matching row on the other side.   Testing for NULL on the right side means that it is a non-matching row.

      One characteristic of the original strategy is that it tries to use “memory” ... and when you have millions of anything, you absolutely cannot use “memory” to do it.   If you don’t literally run out of the stuff (after hours of painful grinding in which your disk-drive might sue for workmen’s comp), you might spend days waiting for a solution, as thrashing sends everything on your computer system to the nether regions.   Whereas the inner-join approach will be virtually instantaneous.   It won’t even take much memory, because all the SQL server’s doing is marching through two compatible index trees.

      Incidentally... “38 million rows” really isn’t that outrageously large.   Oh, but it’s large enough to put a bit of a test on your strategy and to call you out rather quickly if you didn’t pick a good one, though.   So, if you try something and you observe that the system is really getting bolluxed up, it’s time to stop and reconsider how you’re trying to solve this problem.   “There’s more than one way to do it™” and you can be sure that there must be a much-better way.   “Okay, time to sit down and eat a sandwich here... I’m obviously lost in the woods... that is to say, ‘momentarily displaced.’ ”

        Yes. On a large table, indexes are essential for any field (or group of fields) which you're going to use to filter or order your results. Create an index on the UserName field here, and it'll drastically speed things up.

Re: Fetch Mysql huge database
by duyet (Friar) on Sep 29, 2011 at 06:42 UTC

    If you just want to check an entry exists in your table your can just count it, ie.:

    select count(*) from table where table.field='some_value'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://928465]
Approved by fisher
Front-paged by chrestomanci
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (7)
As of 2024-04-16 11:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found