http://qs321.pair.com?node_id=11127497

cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Greetings. I have about 250K records in a postgres db I need to process. I'm wondering if there's a way to read this in batches of 1000 records or something, to cut down on db transaction overhead. THe DBI docs mention methods execute_array and execute_for_fetch that do something in batches, but I don't completely follow what they're talking about. These methods have something to do with tuples. I just want to run a query, get a batch of records, process those, get another batch, and repeat. Super Search on "postgres batch read" returns nothing. Any pointers appreciated.

Replies are listed 'Best First'.
Re: Posgres batch read with DBI?
by choroba (Cardinal) on Jan 26, 2021 at 22:48 UTC
    See the LIMIT and OFFSET clauses in the documentation.
    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      Many thanks for the pointer. Running this (SqlSupport is a homebrew helper module):
      use lib qw ( c:/perlmodules ); use SqlSupport; my $dbh = connectpgdb(*,*,*,*,*); my $query = "select id,post from blog where language = 'en'"; my $sth = $dbh->prepare($query); $sth->execute || die "Could not execute MySQL statement: $sqlstatement +"; my $rows = []; # cache for batches of rows while( my $row = ( shift(@$rows) || shift(@{$rows=$sth->fetchall_array +ref(undef,10_000) || []}))) { my $foo = 1; } $sth->finish();
      It's saying the last bracket is unmatched. But my IDE shows every bracket has a mate, and I clipped the whole while loop from the DBI docs.
        This works for me:
        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use DBI; my ($dbname, $user, $password) = @ARGV; my $db = 'DBI'->connect('dbi:Pg:dbname=' . $dbname, $user, $password, {AutoCommit => 0}); $db->do('CREATE TABLE t (id INT, name TEXT)'); my $populate = $db->prepare('INSERT INTO t (id, name) VALUES (?,?)'); my $max = 250_000; for my $i (0 .. $max) { $populate->execute($i, join "", map chr, map 64 + int rand 26, 1 . +. 10); print "$i\r"; } my $from = 0; my $fetch = $db->prepare('SELECT * FROM t LIMIT ? OFFSET ?'); while ($from <= $max) { $fetch->execute(1000, $from); while (my @row = $fetch->fetchrow_array) { say join "\t", @row; } say '---'; $from += 1000; } $db->disconnect;
        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Posgres batch read with DBI?
by roboticus (Chancellor) on Jan 27, 2021 at 18:57 UTC

    cormanaz:

    OK, I'll be that guy today....

    I don't know what sort of massive updates you're doing, but have you thought about doing it inside the database itself? Databases are pretty good at doing many data munging tasks, and it may be faster to tell the database what/how to do what you want done, rather than slurping all the data to your machine, making the changes and then pushing it back up.

    You didn't mention anything about what sort of processing you're doing, and what you're wanting to do may be the right thing to do. But most of the time I've seen people wanting to process a huge mess of records it would've been faster/easier to just tell the database to do it.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      To add a bit to this, PostgreSQL even has support for embedding a Perl interpreter into the database engine, so it is definitely possible to write the whole update as a stored procedure and run it inside the RDBMS. I think that uses a very different interface from DBI, so it might not be worth the effort.

Re: Posgres batch read with DBI? - ( PostgreSQL )
by erix (Prior) on Jan 27, 2021 at 07:23 UTC

    cut down on db transaction overhead.

    Depending on what you're up to, you can also set AutoCommit off, then count changed records while you process, and commit every N changes (every 1000 or so). It may be simpler.

      With my DBA hat on, I encourage running the program as ONE transaction with auto-commit off. That way, if there is an error after 249k rows, Postgres will rollback the entire transaction and your data will be unchanged and unharmed. Also other clients will not see the in-process transaction and will get a consistent view of the data.

      Break up the transaction only if testing and benchmarks demonstrate a real problem. “Premature optimization is the root of all evil.”

        I agree with you.

        The OP wrote: "Greetings. I have about 250K records in a postgres db I need to process. I'm wondering if there's a way to read this in batches of 1000 records or something, to cut down on db transaction overhead.
        The way to "cut down on transaction overhead" is simply to run fewer write transactions - that doesn't have much to do with how often you read the data.

        The DB can perform many, many read operations per second.
        It will be able to perform much fewer write transactions per second than read operations.

        The DB whichever one you have, will try to maintain a self consistent state on the hard drive. See Wiki - ACID properties.

        There is a lot of overhead involved in a "transaction". The DB will have to write to the hard disk multiple times. A commit of a million row update doesn't necessarily take all that much longer than a commit of 100 rows. The commit operation, in and of itself is "very expensive".

        The Perl DBI provides:

        $rc = $dbh->begin_work or die $dbh->errstr; and $rc = $dbh->commit or die $dbh->errstr;
        Wrapping those 2 statements around any DB operations involving many "writes" will have a huge effect upon performance.
        There is no problem whatsoever with a 1 million line "commit".
        250 commits for 250K records will slow things down considerably - perhaps even an order of magnitude!
Re: Posgres batch read with DBI?
by Corion (Patriarch) on Jan 27, 2021 at 07:39 UTC

    There also is Data::Stream::Bulk, which fetches parts of the results. It has the advantage that you don't need to rewrite your SQL, but you still need to create a loop that fetches and processes the results. Also, as you mention transactions, that module will not help you with transactions growing too large, as it will keep the transaction open and only fetch a slice from the transaction for processing.

    Personally, when doing long-running SQL stuff, I prefer to have it stateless in the SQL and limit the amount of data returned by the SQL like this:

    select top 1000 foo , bar , baz from my_table where baz > ? order by foo, bar

    Depending on the sort criteria, I like to order by some timestamp, so I either upgrade the oldest or the newest rows first.

    Depending on your flavour of SQL, TOP 1000 needs to be replaced by LIMIT 1000 after the WHERE clause.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Posgres batch read with DBI?
by Anonymous Monk on Jan 27, 2021 at 14:46 UTC

    Perhaps something like:

    while($row = shift(@$rowcache) || shift(@{$rowcache=$sth->fetchall_arr +ayref(undef, $max_rows)})) { ... }