A database table size issue

sophate has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I have a huge database table which is over 30GB. I would like to create a Perl script that reads the table row by row. Process the data after reading each row.

I've written the script below but I run into a capacity issue right away as the script eats up all the memory when it tries to store the whole table into an array. Here are the codes.

#!/usr/bin/perl
use DBI;

$ENV{'FREETDSCONF'} = '/home/test/config/freetds.conf';
$ENV{'SYBASE'} = '/home/test/sql_server_15_0-64bit';

my $SERVER = 'TEST';
my $DB = 'test';
my $USER = 'admin';
my $PASSWD = 'admin';
        
$DB_Ref = DBI->connect("DBI:Sybase:server=$SERVER;database=$DB",
                        "$USER",
                        "$PASSWD",
                        {syb_err_handler => \&syb_err_h})
                or die "Couldn't connect to $DB : " . DBI->errstr;
$DB_Ref ->{RaiseError} = 1;
$DB_Ref ->{PrintError} = 1;
$DB_Ref ->do("set dateformat ymd");

## The table testTable is over 30GB
## The line below eats up all RAM on the server
my $testArray = $DB_Ref->selectall_arrayref('select * from testTable')
+;

## Do something on each row here
[download]

Can I output the results of the query one row at a time? Something like an input stream so that I can process the results row by row with something like "while <QUERY_STREAM> {}" to save memory? Many thanks.

Comment on A database table size issue Download Code

Replies are listed 'Best First'.
Re: A database table size issue by roboticus (Chancellor) on Apr 27, 2012 at 09:57 UTC
sophate: You've already gotten a good answer, so I'll just digress a bit. I don't often see a task to process every row in a table where using SQL isn't better choice. It's pretty expensive to serialize each row and ship it to the other machine which does some processing, and then repeats the process to get the data back into the database. Since you haven't mentioned your level of experience with databases, I thought I'd mention a couple things, just in case. Can your task be done with SQL? It's a pretty simple language, and since the database server is optimized to use it, it's frequently faster than anything you'll be able to do locally (especially as you already have the overhead of shipping the data across a pipe). Even if your modifications are too complicated for a simple update, it can be better to make a temporary table and then do a set of transformations in SQL before updating the original table. If you find that you really need to get the data to the local desktop, be sure to get only the data you need to work on (via the `WHERE` clause, appropriate joins, etc.) so you can minimize the impact to your application, and all other users of the database. If you need to process all the rows and it's going to be a frequent operation, you might consider using the bulk-loader capabilities of your database to export the table into a flat file, which you can process easily with perl, and then bulk load the results back into the database. The bulk-loading tools for the databases I use (Sybase, MSSQL server, Oracle) are pretty efficient for whacking large amounts of data (and 30GB certainly qualifies). If you want, feel free to share the types of transformations you're wanting and I'll try to help with the SQL. Update: Typo fix (s/to/too/). ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]
Re^2: A database table size issue by nikosv (Deacon) on Apr 27, 2012 at 14:54 UTC
++ for the good answer but just to add to it, use set based logic rather than procedural row by row processing;Use the server resources rather that the client's. Furthermore even opening just firehose cursors for fetching that amount of rows will place loads of stress over the connection line if not optimised for batch fetching	[reply]
Re^3: A database table size issue by JavaFan (Canon) on Apr 29, 2012 at 22:34 UTC
Use the server resources rather that the client's. Really? Why? Typically, the number of servers is far less than the number of clients. Using server resources that benefit one client hurt all the clients. I'm not saying that in this case the task should be done on the client (we've no idea about the task, nor about the setup), but I disagree with the blanket "use server resources rather than the clients". In way too many cases that isn't scalable to do without thinking.	[reply]
Re^4: A database table size issue by nikosv (Deacon) on May 01, 2012 at 16:24 UTC
Re^5: A database table size issue by JavaFan (Canon) on May 01, 2012 at 16:52 UTC
Re^2: A database table size issue by sophate (Beadle) on Apr 29, 2012 at 00:10 UTC
Thanks for the suggestion. The script was really slow when I processed the table row by row. I took your advice and rewrote the SQL to let the DB to handle part of the processing. The script now runs much faster than before :-)	[reply]
Re: A database table size issue by Anonymous Monk on Apr 27, 2012 at 08:24 UTC
Sure, don't use selectall, use one of the other methods that start with fetch See also Tutorials->Database Programming->DBI recipes	[reply]
Re^2: A database table size issue by sophate (Beadle) on Apr 27, 2012 at 08:51 UTC
It's exactly what I'm looking for !! Thanks so much !!	[reply]
Re: A database table size issue by sundialsvc4 (Abbot) on Apr 27, 2012 at 12:31 UTC
Echoing (and upvoting) Roboticus' suggestions, I have found that there can simply be tremendous improvements to be had when the TCP/IP communication link (no matter how fast it is) is eliminated from the processing picture. If you’ve got a lot of data to be processed, first get it in table-form (or if necessary, flat-file form) to that server or to another server that might be connected to it, say, by an optical fiber link, e.g. a SAN or what-have-you. Then, do the processing as “locally” as you possibly can, provided of course that you do not overload a server that has been dedicated to and configured for a particular task, with a task for which it is not intended. Then, construct your processing in a way that features these considerations, for instance: Note that in the following as originally writ, I assumed the update file was 30GB. As far as database-tables go, 30GB is merely “moderate size.” The basic principles pontificated :-) here, still hold. It is done mostly with SQL, but it is not an arduous task. Use `EXPLAIN` on the queries that you are contemplating, to see exactly how the system will do it. Explore the process you have in mind; map it out on paper before you build it. Avoid “build it just to see if it works,” because then you’re always scrambling, and losing credibility quickly. The best way to stay on top of the ball, is to studiously avoid falling off. Bulk-operations are done as much as possible with tools and/or database-facilities that are designed by the vendor for the purpose. Pay attention to transactions: use ’em. Transactions are a clear indication to the server as to how much data it can keep “in memory” for a few milliseconds, and how much must be immediately committed to disk. Disk-I/O delays add up very quickly. But, don’t let your transactions become too big. `COMMIT` regularly throughout, and of course, once more at the end. If an operation fails, `ROLLBACK.` It is desirable to use “highly exclusive of others” transaction isolation levels, but this might need to be coordinated with other simultaneous activities. (You also must consider whether other users/applications should be permitted to see incompletely-loaded or incompletely-modified data.) What IBM used to call “checkpoint/restart.” If the process of churning through 30GB of data fails at any point, it should be possible to fairly-automatically resume. Likewise, if it is discovered that the data which was applied was incorrect, a strategy must exist for getting the database back to a previous known-good state. Also consider how you intend to identify, catalog, keep, and then eventually discard those 30GB data files.
Re^2: A database table size issue by sophate (Beadle) on Apr 29, 2012 at 00:14 UTC
Thanks for your detailed and in-depth reply. Really appreciate the help from you and other monks.	[reply]


Perl-Sensitive Sunglasses
	PerlMonks