Fetch Mysql huge database

ym_chaitu has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fetch Mysql huge database by BrowserUk (Patriarch) on Sep 29, 2011 at 06:38 UTC
For bulk processing like this, I'd sidestep DBI and use command lines tools. Create a file called: `getnames.sql` containing: `select UserName from table where UserName <> '' and UserName not like +'deleted%'` [download] And two perl scripts. 1) `partNames.pl`: `#! perl -w use strict; my %fhs; while( <> ) { my $char = substr $_, 0, 1; if( not exists $fhs{ $char } ) { open $fhs{ $char }, '>', "names.$char" or die $!; } print { $fhs{ $char } } $_; }` [download] And 2) `uniqNames. pl`: `#! perl -w use strict; for my $fname ( glob 'names.*' ) { my %uniq; open my $fh, '<', $fname or die $!; while( <$fh> ) { print unless exists $uniq{ $_ }; $uniq{ $_ } = 1; } }` [download] Then run the following pair of commands: `mysql --quick < getnames.sql \| perl partNames.pl perl uniqNames.pl > uniqedNames.txt` [download] The first line dumps the names and pipes them to the perl script that then writes them to a set of files (called `names.?`, where ? is the first character of the name). The second script reads those files one at a time and writes the unique names to stdout where they are directed into a file called `uniqedNames.txt`. The whole (untested) process should take less time than sorting the 34 million names -- inside the DB or with your system sort utility. I'd hazard a guess that it would take less than 20 minutes. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Fetch Mysql huge database by ym_chaitu (Initiate) on Sep 29, 2011 at 11:05 UTC
Hai Thank you for taking time and giving me a reply. But the script which as given here didn't showed the expeted results. i have changed your script and worked on a test database first it gave one error in the partNames.pl so i have changed the `my $c = substr $, 0, 1;` line to `my $c = substr $_, 0, 1;` something like this. then it started working but it stored all the values from the table into one file called as "names" even with the column name and the duplicate values. then when i ran the uniqNames.pl script i didnt got any result the file created is of 0KB. can you kindly help me on this. Once again Thank you for your reply. Pardon me if there are any mistakes	[reply] [d/l] [select]
Re^3: Fetch Mysql huge database by BrowserUk (Patriarch) on Sep 29, 2011 at 11:19 UTC
didn't showed the expeted results. I did warn that the code was untested. i have changed the my $c = substr $, 0, 1; line to my $c = substr $_, 0, 1; Correct correction :) but it stored all the values from the table into one file called as "names" I assigned the first char to a variable `$c`, but then used `$char` everywhere else. You should have received a message: `Name "main::char" used only once: possible typo ...`? I've updated the post above to correct the errors you've pointed out, but there may be more. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^4: Fetch Mysql huge database by ym_chaitu (Initiate) on Sep 30, 2011 at 05:31 UTC
Re^5: Fetch Mysql huge database by BrowserUk (Patriarch) on Sep 30, 2011 at 07:59 UTC
Re: Fetch Mysql huge database by fisher (Priest) on Sep 29, 2011 at 06:22 UTC
The main idea behind this database thingie is that you do not need to fetch the whole dataset to client. If you do need it, which is a rare case, it is a bad system design. In most cases, however, you need only a small subset of stored data, and your job is to select it properly. Second, the mysql server itself can operate too slow in some circumstances -- e.g. no index on 'UserName' field can produce inappropriate time consumption. How many time it takes when you select above dataset from mysql client itself? Third, you need a space, a memory to store this dataset locally on your client; did you count how much memory do you need to store your 34748806 rows?	[reply]
Re: Fetch Mysql huge database by Neighbour (Friar) on Sep 29, 2011 at 06:35 UTC
Using `SELECT DISTINCT` is exactly what you need here though. Since that is running too long, the issue lies at the MySQL database instead. Here are some ideas that might help you. Check if there is an index on the column UserName. Check if there is enough RAM in the machine where the database is at. If there isn't, MySQL will create a temporary table on disk, which is generally very slow. Check if your query is blocked by other queries that operate on that table using `SHOW FULL PROCESSLIST` on the database.	[reply] [d/l] [select]
Re: Fetch Mysql huge database by Anonymous Monk on Sep 29, 2011 at 11:49 UTC
Ummm... `SELECT BIG.primary_key FROM big_table BIG LEFT OUTER JOIN other_table OTH ON (BIG.primary_key = OTH.primary_key) WHERE OTH.primary_key IS NULL;` Now you have the primary-keys of all the rows in BIG that do not have a corresponding row in OTH. You didn't write a single line of Perl code. Not a single row (except the solution) ever left the confines of the SQL server. The database did all the work. If you need to test other combinations of columns, build an index on those column-sets in both tables first.	[reply]
Re: Fetch Mysql huge database by rdfield (Priest) on Sep 29, 2011 at 15:02 UTC
getting rid of duplicates too: `select username from table where username != '' and username not like 'deleted%' and username not in (select username from other_table) group by username having count(*) > 1;` [download] I've never used mysql, but in other RDBMS systems I would check that enough memory has been assigned to buffer cache, sort area, hash sort area, etc. And by "enough", I mean "lots". Joining two 34M rows on a decent enough server shouldn't take long. Nested iterations over 2 large tables does take a long time. Having indexes on username on both tables would help a lot too rdfield	[reply] [d/l]
Re^2: Fetch Mysql huge database by sundialsvc4 (Abbot) on Sep 29, 2011 at 17:32 UTC
Indexes would be quite mandatory, to the point that some DBMSes will temporarily create indexes in order to do such a join, or refuse to do the join. Even if you do wind up creating indexes only to immediately drop them, or if you set up an index that is “non-maintained” and it must be refreshed now and then, c’est la guerre. Just a cost of doing business. (As of course you surely know... but just for the benefit of any onlookers who maybe don’t...) Left and right so-called “outer” joins are those which cause all of the rows in the table on the left or the right side of the relationship to appear even if they have no matching row on the other side. Testing for `NULL` on the right side means that it is a non-matching row. One characteristic of the original strategy is that it tries to use “memory” ... and when you have millions of anything, you absolutely cannot use “memory” to do it. If you don’t literally run out of the stuff (after hours of painful grinding in which your disk-drive might sue for workmen’s comp), you might spend days waiting for a solution, as thrashing sends everything on your computer system to the nether regions. Whereas the inner-join approach will be virtually instantaneous. It won’t even take much memory, because all the SQL server’s doing is marching through two compatible index trees. Incidentally... “38 million rows” really isn’t that outrageously large. Oh, but it’s large enough to put a bit of a test on your strategy and to call you out rather quickly if you didn’t pick a good one, though. So, if you try something and you observe that the system is really getting bolluxed up, it’s time to `stop` and reconsider how you’re trying to solve this problem. “There’s more than one way to do it™” and you can be sure that there must be a much-better way. “Okay, time to sit down and eat a sandwich here... I’m obviously lost in the woods... that is to say, ‘momentarily displaced.’ ”	[reply]
Re^3: Fetch Mysql huge database by aaron_baugher (Curate) on Sep 29, 2011 at 20:37 UTC
Yes. On a large table, indexes are essential for any field (or group of fields) which you're going to use to filter or order your results. Create an index on the UserName field here, and it'll drastically speed things up.	[reply]
Re: Fetch Mysql huge database by duyet (Friar) on Sep 29, 2011 at 06:42 UTC
If you just want to check an entry exists in your table your can just count it, ie.: `select count(*) from table where table.field='some_value'`	[reply] [d/l]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks