http://qs321.pair.com?node_id=150255

BLOBS - outlining the problem

The DBI documentation explains that the DBI can handle BLOB fields up to the maximum size allowed by the system. Usually, inserting or fetching a JPEG file a few KB long is not a problem. If the table field has been defined big enough to receive the intended data, the operation should succeed. However, dealing with files more than 1 megabyte long becomes rather tricky.

Let's have a look at the basics.

BLOB stands for Binary Large Object. It is a field that can store a large amount of data. Its size depends on the implementation. MySQL defines 4 types of BLOB.
type max size ------------ ------------- TINYBLOB 255 BLOB 65_535 MEDIUMBLOB 16_777_215 LARGEBLOB 4_294_967_295
Looking at the above table, we would say that we should not have problems at all. Provided that we have enough room in our database server disks, we can store as much as 4 gigabyte for each field.
Unfortunately, things are not that simple. The above sizes for MEDIUMBLOB and LARGEBLOB are theoretical. The real maximum size depends on the maximum size allowed by your filesystem (mostly is 2 GB), since is higly unlikely that you can load a 4GB file to the database if your OS allows you only 2.
But even if we overcome this first obstacle, a more challenging restriction is waiting. The database engine has a max_allowed_packet limit, which prevents the client from dealing with packets exceeding such barrier.
When we look at the MySQL documentation, we learn that due to the limits imposed by the client/server protocol1, we can't exceed 16MB per packet. It means that an INSERT query or the contents of a fetchrow statement can't be bigger than that.
Moreover, the default value for this limit is 1 MB. You can change it, if you are the database administrator, or else you must live with it.

How?

The DBI docs say that there is no current mechanism in place to upload or download BLOB fields in chunks.
MySQL LOAD_DATA(filename) function could be seen as a solution. Unfortunately, this function can only work if the file is on the server, thus leading to any sort of headaches related to user maintenance, security and so on. Moreover, the limit of max_allowed_packet is still in place. So we don't really gain anything by uploading a file to the server and then calling LOAD_DATA.

1The 16 MB limit has been increased to 2 GB in MySQL 4.0.1, which is still in alpha, however. See the docs.

Perl to the rescue

We love Perl, we are programmers, so we shouldn't mind a programmatic solution to this hard limit.
There are applications where we need to store large amounts of data, far larger than the limits set by the protocol or, even worse, by an unreachable database administrator.
Here is how I have done it.
UPLOAD FLOW DOWNLOAD FLOW +-------------------------+ +-------------------------+ | | | DB field < 2 MB | | | | | | large binary file | +-------------------------+ | (5.5 MB) | | DB field < 2 MB | | | | | | | +-------------------------+ +-------------------------+ | DB field < 2 MB | | | | V +-------------------------+ +-------------------------+ | | file chunk < 2 MB | V | | /=========================\ +-------------------------+ || protocol limit || | file chunk < 2 MB | || 2 MB || | | \=========================/ +-------------------------+ | | file chunk < 2 MB | V | | +-------------------------+ +-------------------------+ | file chunk < 2 MB | | | | V +-------------------------+ /=========================\ | file chunk < 2 MB | || protocol limit || | | || 2 MB || +-------------------------+ \=========================/ | file chunk < 2 MB | | | | V +-------------------------+ +-------------------------+ | | DB field < 2 MB | V | | +-------------------------+ +-------------------------+ | | | DB field < 2 MB | | | | | | large binary file | +-------------------------+ | (5.5 MB) | | DB field < 2 MB | | | | | | | +-------------------------+ +-------------------------+
How do we achieve our goal then? By splitting the file into pieces that are less than the protocol bottleneck, sending them to the database, properly indexed so that we can easily retrieve them, and then fetching the pieces on demand, to rebuild the original file.
The flow of our program is simple. Find the size of the bottleneck, read the file in slices smaller than the limit, and send them to the database table one by one. Each field in the table is identified by a name and a sequential number.
The download phase is the reverse. Select the records that compose our file, sorting them by the sequential number. Then rebuild the file by adding each piece.
The responsibility is equally divided between the database engine and the Perl script.
The server will record the chunks, carefully labelled so that they can be retrieved. The script should send the pieces in the righ order and use them in the same order during the retrieval phase.
The binary packages are stored in a database table with this structure:
+-------------+---------------+------+-----+---------+-----------+ | Field | Type | Null | Key | Default | Extra | +-------------+---------------+------+-----+---------+-----------+ | id | int(11) | | PRI | NULL | auto_incr | | name | varchar(50) | | MUL | | | | description | varchar(250) | YES | | NULL | | | vers | varchar(15) | YES | | NULL | | | bin | mediumblob | YES | | NULL | | | filename | varchar(50) | | | | | | username | varchar(30) | | | | | | updated | timestamp(14) | YES | | NULL | | +-------------+---------------+------+-----+---------+-----------+
ID is a sequential number, automatically generated for each insertion. Using this number to sort our records when we retrieve them, we are sure to get the same order that we used when we uploaded them.
name is the unique identidier of the package. It is repeated for each record containing a package chunk.
bin contains the data
filename is useful to reconstruct the original file.
The rest is optional.

A software repository

The sample application is something that can actually be useful. It's a software repository, where large pieces of binary files are stored for future use. The interface is not the most user friendly, but the purpose is to have a look at the innards. I have embedded my comments within the code, in Perl style, so that it should be easy to follow the program flow.
#!/usr/bin/perl -w use strict; use DBI; =head1 NAME blobs.pl -- script to upload / download HUGE BLOB fields to a MySQL database =head1 SYNOPSIS For the purpose of this tutorial, this script will create a B<software repository>, where you can upload binary packages, list their status and download them to a file. $ perl blobs.pl u perl perl_stable.tar.gz "5.6.1" "my latest version" Uploads the perl binary package (> 5 MB) to a database table, splitting the file into chunks if necessary $ perl blobs.pl l perl Lists the details of the "perl" package stored in the database $ perl blobs.pl d perl perl_stable.5.6.1.tgz Downloads the perl binary and saves it to a new file =head1 The script =head2 parameters u|d|l|r = (u)pload | (d)ownload | (l)ist | (r)emove name = the name of the package that we want to upload / download / list. In the latter case, you can use DB wildchars ('%' = any sequence of chars, '_' = any character) filename = the name of the file to upload / download. Mandatory only for uploading. If missing when we download, the name stored in the database is used. version = free text up to 12 characters description = free text up to 250 characters =head2 Status of this script This script is mainly provided for tutorial purposes. Although it works fine, it is not as robust as I would like it to be. I am planning to make a module out of it, to isolate the data management from the interface. Eventually I will do it. In the meantime, please forgive my hasty interface and try to concentrate on the theory behind it. Thanks. =head2 handling parameters Nothing fancy. Interface to a minimum. Parameters are read sequencially from the command line. Optional parameters are evaluated according to the current operation. =cut my $op = shift or help(); # operation (list / upload/download) help() unless $op =~ /^[udlr]$/; my $softname = shift or help(); # package name my ($filename, $version, $description)=(undef,undef,undef); if ($op eq "u") { # read optional parameters $filename = shift or help(); $version = shift; $description = shift; } elsif ($op eq "d") { $filename = shift; } =head2 connection If this were a module, you would have to pass an already constructed $dbh object. Since it is a script, instead, you should modify the statement to suit your needs. Don't forget to create a "software" database in your MySQL system, or change the name to a more apt name. =cut my $dbh = DBI->connect("DBI:mysql:software;host=localhost;" . "mysql_read_default_file=$ENV{HOME}/.my.cnf", undef,undef, {RaiseError => 1}); =head2 Table structure The table is created the first time the script is executes, unless it exists already. =cut #$dbh->do(qq{CREATE DATABASE IF NOT EXISTS software}); $dbh->do(qq{CREATE TABLE IF NOT EXISTS software_repos (id INT not null auto_increment primary key, name varchar(50) not null, description varchar(250), vers varchar(15), bin mediumblob, filename varchar(50) not null, username varchar(30) not null, updated timestamp(14) not null, key name(name), unique key idname (id, name) )}); =head2 scrip flow depending on th value of $op (operation) the appropriate subroutine is called. =cut if ($op eq "l") { list($softname); } elsif ($op eq "u") { upload($softname, $filename, $version, $description) } elsif ($op eq "r") { remove($softname); } else { download($softname, $filename) } $dbh->disconnect(); =head2 functions =over 4 =item getlist() getlist() gets the details of a given package stored in the database and returns a reference to an array reference with the selected table information. =cut sub getlist{ my $sname = shift; my $row = $dbh->selectall_arrayref(qq{ select name, vers, count(*) as chunks, sum(length(bin)) as size, filename, description from software_repos where name like "$sname" group by name }); # the GROUP BY clause is necessary to give the total # number of chunks and the total size return $row; } =item list list() calls internally getlist() and prints the result =cut sub list { my $sname = shift; my $row = getlist($sname); return undef unless $row->[0]; print join "\t", qw(name ver chunks size filename description),"\n"; print '-' x 60, "\n"; print join "\t", @{$_},"\n" for @$row; } =item remove remove() will delete an existing package from the database table. Nothing happens if the package does not exist. =cut sub remove { my $sname = shift; $dbh->do(qq{ delete from software_repos where name = "$sname"}); } =item upload upload() reads a given file, in chunks not larger than the value of max_allowed_packet, and store them into the database table. =cut sub upload { my ($sname, $fname, $vers, $descr) = @_; open FILE, "< $fname" or die "can't open $fname\n"; my $maxlen = getmaxlen(); # gets the value of max_allowed_packet my $bytes=$maxlen; $fname =~ s{.*/}{}; # removes the path from the file name print "$fname\n"; my $sth = $dbh->prepare(qq{ INSERT INTO software_repos (name, vers, bin, description, filename, username, updated) VALUES ( ?, ?, ?, ?, ?, user(), NULL)}); # before uploading, we delete the package with the same name remove($sname); # now we read the file and upload it piece by piece while ($bytes) { read FILE, $bytes,$maxlen; $sth->execute( $sname, $vers, $bytes, $descr, $fname) if $bytes; } close FILE; } =item download download() is upload() counterpart. It fetches the chunks from the database and compose a new binary file. =cut sub download { my ($sname, $fname) = @_; # if we don't supply a name, the one stored in # the database will be used unless (defined $fname) { my $row = getlist($sname); die "$sname not found\n" unless $row->[0]; $fname =$row->[0][4]; } # checks if the file exists. Refuses to overwtite if (-e $fname) { die "file ($fname) exists already\n"; } open FILE, "> $fname" or die "can't open $fname\n"; my $sth = $dbh->prepare(qq{ SELECT bin from software_repos where name = ? order by id }); $sth->execute($sname); my $success =0; while (my @row = $sth->fetchrow_array()) { syswrite FILE, $row[0]; $success =1; } close FILE; die "$sname not found\n" unless $success; } =item getmaxlen getmaxlen() will return the value of max_allowed_packet =cut sub getmaxlen { my $rows = $dbh->selectall_arrayref( qq{show variables LIKE "max_allowed_packets"}); for (@$rows) { # returns the max_allowed_packet # minus a safely calculated size return $_->[1] - 100_000 } die "max packet length not found \n"; } =item help help() gives a summary of the script usage =back =cut sub help { print <<HELP; usage: blobs {l|u|d|r} name [[filename] [version] [description]] Where l|u|d|r is the operation (list|upload|download|remove) name is the name of the software to be uploaded|downloaded filename is the file to send to the database (upload) or where to save the blob (download). Optionally, you can supply a version and a description HELP exit; }
Sample usage:
$ perl blobs.pl u mysql4 mysql-4.0.1-alpha.tar.gz "4.0.1" "MySQL alpha +"
This command will store the MySQL binary distribution (10 MB), divided into several small pieces, to make them pass through the bottleneck (currently 2 MB in my database server).
To restore the file, I will issue the command:
$ perl blobs.pl d mysql mysql-4.0.1-alpha.tar.gz

I am presently using this software repository (with a totally different interface, but this is another story) to distribute software updates among a few hundred users. There is a centralized management, where the developers upload their new releases to the system, and the users will download and put them to good use.
I can think of several other uses for this architecture, such as multimedia repository, archives of office documents, and I leave the rest to your imagination.

Enjoy!
 _  _ _  _  
(_|| | |(_|><
 _|   

Replies are listed 'Best First'.
Re: Handling huge BLOB fields with DBI and MySQL
by gmax (Abbot) on Mar 08, 2002 at 13:52 UTC

    Storing a large piece of data in a database is not always the best solution, but sometimes is the only thing you can do, so it's good to know that you can.

    On the other hand, If I store something in a database, it's because I need to manipulate it. If I just store it to a database and then retrieve it without any additional benefit such as searching, counting, averaging, summing up, then I am just wasting my time and I should rather resort to FTP.

    For my side, I don't like using filenames in the database and the data stored in files, because it adds one level of complexity to my system. Not to the uploading part, which could be actually easier, but to the maintenance business.

    If I have everything in my database, I should take care of one thing only. All the system is in one place, for good and for bad.

    I actually tried the filenames way and I found that it is not as practical as I want.

    I noticed that by not letting the database dealing with the storage, some operations were even more complicated. For example, I need to guarantee the data integrity of my fields, and I do so by running a CRC function and comparing the results to other values stored in a table. I can do this with one single query. Not very practical to do with the files scattered in the filesystem. Or I need to calculate the total size of my fields, depending on the values of some other fields. All things that you can do if you have access to the server's filesystem, while the database can accomodate your needs with a simple call. Besides, you may want to store BLOBs that you need to search for. They could be documents, whose header you need to index. I can find so many advantages of having the data in a database, that I am very reluctant to the idea of relinquishing my possibilities to do so.

    The difficult thing is only sending the data and fetching it. Apart from that, the rest of the operations are easier in a database than in a collection of files.

    I am not saying that the "filenames" solution is wrong. I just want to have it simple. For my needs, BLOBS are the right choice.

    I am basically agnostic, so I won't try to convince anybody that what I am doing is the thing to do. Rather, I tell my reasons and I will welcome anybody who share my thoughts. I know that many people want to pursue the BLOBs choice but they have had technical problems. So I am offering my solution.

     _  _ _  _  
    (_|| | |(_|><
     _|   
    
      While this is not a perl question, how does storing these files in blobs impact your database recovery procedures?

      From what I understand, MySQL does have a (slightly crude) recovery method -- the database keeps an update log of all activity. You can then replay the update log which stores the changes since your last database backup to bring the system "up to date". (Although I have not played with this, I also assume you could edit the update log to simulate a point in time recovery.)

      It would seem that storing just the file names (assuming you do not have version control in place for the documents) means that you would have a very difficult time recovering from certain failures. However, (as I mentioned earlier, I have not played with the MySQL update logs) it would also seem that storing the changes to multi-meg or multi-gig fields would cause the update log to exceed the OS file size limitations.

      How have these concerns impacted your implementations?

        Recovery of a database could be as easy as running your latest backup and restart business, if you are well organized.

        If you are using binary logs, the system can recover fairly easily. BLOBs are not a problem here, they are just more data in your database.

        About organizing yourself, you might have noticed that I added a timestamp field to my table. This way, I can have a progressive backup of the fields that were modified in a given timeframe, to integrate with a full weekly backup.
        The subject deservers more space than we can dedicate here. The matter is explained much better than this in Paul Dubois' book, MySQL.

        Personally, I would say that storing blobs in sparse files makes your task more difficult, but TMTOWTDI, after all, and I might be wrong. Let's say that I am just more confortable with my current architecture.

         _  _ _  _  
        (_|| | |(_|><
         _|   
        
      If you are writting binary blobs to file don't you need to set binmode on your output stream ? ie binmode OUTFILE;

        binmode is mandatory only on MS-DOS/Windows systems.

        perldoc -f binmode
        On some systems (in general, DOS and Windows-based systems) binmode() is necessary when you're not working with a text file.
Re: Handling huge BLOB fields with DBI and MySQL
by lachoy (Parson) on Mar 08, 2002 at 13:00 UTC

    This seems like an interesting solution and you explain it very clearly here. But honestly, I've very rarely found it necessary to store entire files in the database. Particularly because BLOB handling is one of those database features that varies widely from DBMS to DBMS and any solution I write I know I'll have to support in at least three systems :-)

    Instead I store pointers to the files -- filenames -- in the database and let the filesystem do what it does best. The most flexible type of filename stored in the database is a partial one, so the calling program needs to have a parameter/property for the root directory. Other tools (e.g., rsync) have been developed and optimized to take care of the distribution/replication issue for files.

    Chris
    M-x auto-bs-mode

      I have an irrelevant question in regards to BLOB actually. I recently had to do a migration of a database that contains BLOB field (images), I had a exported .sql file which I imported into the new database. Here comes the problem. The BLOB (image) field was unable to load in the new server, anyone know why that is? I have the exact same codes/files as the old one on the new server...Other fields load fine from the database except the BLOB field. Thanks alot.
Re: Handling huge BLOB fields with DBI and MySQL
by mpeppler (Vicar) on Mar 08, 2002 at 19:48 UTC
    Interesting way to handle this sort of thing within the limits of your database server.

    Personally I'm with lachoy here - I avoid storing BLOBs in the database because the interface to fetch/store them is often convoluted.

    On the other hand storing everything in the database potentially simplifies the administration of your data, so you probably need to balance one against the other.

    As usual I'll now give you the Sybase POV :-)

    Sybase stores BLOBs in IMAGE or TEXT columns. These are variable size datatypes that can store up to 4GB, assuming the database is large enough, but they are not limited by the database device size or filesystem limitations of the underlying OS.

    IMAGE/TEXT columns can be written and fetched from using normal SQL statements - however as with MySQL the server sets default limits on the amount of data that can be transfered in a single operation (see the TEXTSIZE option).

    To circumvent this limit (or to insert really large BLOBs) you can use special API calls to store/fetch the data in chunks. This API is available in DBD::Sybase, Sybase::CTlib and Sybase::DBlib, but it's a little convoluted (especially for writes, where you first have to fetch a "text pointer" and then call the write function with this pointer).

    For example, with DBD::Sybase you'd do something like this (taken from the DBD::Sybase man page):

    # update a database entry with a new version of a file: my $size = -s $file; # first we need to find the CS_IODESC data for the data $sth = $dbh->prepare("select img from imgtable where id = 1"); $sth->execute; while($sth->fetch) { # don't care about the data! $sth->func('CS_GET', 1, 'ct_data_info'); } # OK - we have the CS_IODESC values, so do the update: $sth->func('ct_prepare_send'); # Set the size of the new data item (that we are inserting), and +make # the operation unlogged $sth->func('CS_SET', 1, {total_txtlen => $size, log_on_update => +0}, 'ct_data_info'); # open the file, and store it in the db in 1024 byte chunks. open(IN, $file) || die "Can't open $file: $!"; while($size) { $to_read = $size > 1024 ? 1024 : $size; $bytesread = read(IN, $buff, $to_read); $size -= $bytesread; $sth->func($buff, $bytesread, 'ct_send_data'); } close(IN); # commit the operation $sth->func('ct_finish_send');

    Like I said - the API is pretty convoluted...

    Michael

Re: Handling huge BLOB fields with DBI and MySQL
by Coruscate (Sexton) on Jan 23, 2003 at 10:29 UTC

    I know I'm a little late in commenting on this post, but better late than never :)

    I must be tired, or I momentarily blanked out, because I was looking at the upload() function and saying "But that doesn't make sense: he's trying to do an INSERT on the same ID for every length $maxlen, which should definitely fail because ID is a primary key". Then I remember that it was an auto-incrementing column: so I looked at the INSERT statement and (yay!) noticed that the id was not explicitly being passed. So I realized that one file had multiple rows within the table, identifyable by the name of the package. (sname). I then went on to find out that the data was retrieved via the fact that the IDs would be in the correct order when sorted. Bingo, I found the main key, everything made perfect sense.

    Then I thought of one more thing and I looked for something in the code to sastisfy my personal needs, to no avail. Anywhere within that code, is there something that stops a person from uploading the same softname and filename more than once? In this case, there would really be two files in the DB with the same key identifier (the name column). Then, if a person were to attempt to download by that name, they'd get 2 files concatenated into one, with no warning. So should I go to bed because I completely missed a line in the code somewhere that dies if the package exists in the DB or am I right? And since this code is almost a year old right now, I suppose it has been updated since then? :)

      Actually, there is nothing in the database design preventing a double entry in the repository. Only the application is enforcing data integrity.

      You notice that the script is deleting the current version of the application stored, in order to avoid such duplicate.

      The real repository has a further control, a UNIQUE key on "name" + "vers" (which is defined NOT NULL) + the file CRC calculated before the update, so that duplicates are harder to pass. This rule relies on the assumption that all application versions are incremented automatically at each build.

      Add to this that the ones allowed to update this table are all seasoned developers, who are using the same application to update. They are all disciplined and each one is in charge of a different application, so I didn't have a single case of double entries in this paradigm.

      Thanks for your comments.

       _  _ _  _  
      (_|| | |(_|><
       _|