Storing/Retrieving images as blobs

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Storing/Retrieving images as blobs by dHarry (Abbot) on Sep 26, 2011 at 06:59 UTC
I once worked on a project moving TIFF images (scanned passports and other id docs) stored as BLOBs in an Oracle database to the file system. We did run into performance problems and moving them out solved things. I don't know the limits of mySQL but I think sooner or later you will run into performance problems as well. In the end the file system gives better performance (scales better). You have many small images, so some clever tree structure to quickly locate an image is needed. Many little files in one directory is normally an OS-killer. There are also dedicated image databases around which do a good job. Cheers Harry	[reply]
Re^2: Storing/Retrieving images as blobs by tweetiepooh (Hermit) on Sep 26, 2011 at 11:03 UTC
Why a clever tree structure when you could have a table that contains a file name? I'm not expert enough to know whether a binary store in the table or file in file system is the winner though. Having a file would mean the file is available should the database be down and if you use some logical scheme should be locatable and accessible by real people.	[reply]
Re^3: Storing/Retrieving images as blobs by MidLifeXis (Monsignor) on Sep 26, 2011 at 13:13 UTC
The OS killer issue alluded to above is when you have a large number of entries in a single directory. In order for the OS to resolve the file name it needs to look up each portion of the file name in the directory. If there are a significant number of entries in a directory, multiple reads on disk may be necessary to find the particular entry. As an example: assume that you can store 20 directory entries in a block that can be read in a single shot. If you have 100 files, then on average, you will need three disk reads to find the location of a particular file (N/2 comparisons gets you to the third block of five necessary to store 100 entries). This increases to 25 reads for 1000 entries (1000 / 2 => 500, 500 / 20 => 25). If you can break the directory structure into N parts so that no directory contains more than 20 entries, you can find the file location, on average, in two reads (100 / 20 => 5 entries in each subdirectory). With 2 levels of no more than 20 entries each, you can store and retrieve 400 files with a guaranteed maximum number of reads of 2. 8000 in three, 160K in 4, 3.2M in 5, and so on. The value of 20 depends on the file system and how directory lookups are done in the OS or the application. A full `readdir` scan at the application level is less able to be optimized by the OS than having the OS open the file for you. Basically, you need to have a depth of X and number of entries of Y where Y^X is the number of files you can store (N) before you are willing to reorganize your files. If the directory entries are actually stored in a database (I think that BeOS had that feature), then you may not need to worry about the read sizes and how many directory entries can fit into a single read. As an interesting aside, the classical Unix File System used this technique to minimize the number of lookups it had to do to find the data blocks for given files. The inode pointed to by the directory entry had pointers to data blocks for small files, then some pointers to blocks of pointers to data blocks for larger files, and then some pointers to blocks of pointers to blocks of pointers for even larger files. The larger the number of ~~files~~ data blocks, the deeper the "directory of pointers" became. --MidLifeXis	[reply] [d/l]
Re^3: Storing/Retrieving images as blobs by dHarry (Abbot) on Sep 26, 2011 at 11:49 UTC
There are several pros/cons to both alternatives and a detailed tradeoff analysis is needed to draw any conclusions. In my experience the database sooner or later grinds to a halt when the tables with BLOBs become too big. After that there is no choice really. The 3GB of the OP doesn't seem too big but I'm thinking Oracle here. I have no idea when mySQL gives up, I suppose it depends on type/version of mySQL database, configuration and hardware etc. Assuming you reach the point where you have to move the images to the file system, it pays of the design a directory structure to prevent the OS from choking and/or improve access times. You typically don't want millions of small files in one directory. As I mentioned an alternative is a dedicated database product to store the images. These databases are optimized for that purpose.	[reply]
Re: Storing/Retrieving images as blobs by chrestomanci (Priest) on Sep 26, 2011 at 09:43 UTC
I don't think you will see performance problems as your table fills up with images, because the number of images stored will be relatively low, and your database will not be trying to index the binary data in the blobs. If you had a normal 3Gb table where the cells contained text, numbers or foreign keys, then there might easily be 100 million rows, the database would be maintaining indexes and foreign key relationships on all those rows, which would be a lot of work. In your case your 3Gb table probably only contains 150 thousand rows, which is fairly small, so not much work to index. I think the total size of the data will only become a problem when it becomes an issue for the underlying file system that your database uses to store it's data, so if it fills the disc to the point that it becomes heavily fragmented or suchlike then there will be a problem, but you would have had those problems if you where storing the files on disc anyway. Having said that, I don't think you should be storing the image files in your database, because by doing so you will make the database a lot larger which makes it harder to backup, restore or run in a cluster. The problem is that it is generally hard to incrementally backup a database without deep knowledge of it's schema, so most backup will be full backups. On the other hand it is easy to incrementally backup a directory full of files, so I think your best strategy long term would be to keep the binary data out of the database, and so keep it small, separately store your images on the file-system, and develop a backup procedure that backs up both the database and the file-system for the images. Seeing as everyone needs to backup file-systems it should be very easy to find a suitable tool to backup that part. Contrary to what dHarry said, I don't think a huge number of files in one directory will hurt file system performance, as modern Linux file-systems use modern data structures to store the list of files in a directory and can easily cope with thousands of files per directory, however you should probably split things up anyway for your own sanity, because while the file system will cope fine with 100_000 files in a directory, ls, or worse a GUI file browser will not cope so well.	[reply]
Re^2: Storing/Retrieving images as blobs by dHarry (Abbot) on Sep 26, 2011 at 11:38 UTC
Maybe I should have clarified what I mean with "huge", obviously a few thousand is not huge. I am more thinking about millions of small files. Normally OS's don't like directories with so many files. (The OP doesn't have millions of files though.) It's nice that some OS's have better support for this nowadays but the OP didn't specify his OS/version, maybe he is using and old Unix/Windows OS? Chopping things up in subdirectories will also improve search/access times.	[reply]
Re: Storing/Retrieving images as blobs by sundialsvc4 (Abbot) on Sep 26, 2011 at 12:54 UTC
I think that the general consensus is that the problem becomes that there is “one thing,” and it is a big one-thing. The difficulty is a management issue, not a technical performance issue. The traditional solution seems to be to store the data as separate files, in some kind of a subdirectory structure, using the database table as a reference. However, this can cause other issues ... because it presupposes the existence of a fine shared-file network filesystem between the various servers. Sometimes you want to keep everything under the auspices of the database server. One strategy that I have seen used is to maintain multiple tables of images. Each image is uniquely identified (e.g. a UUID), and a master directory-table gives the (database name and) table-name where that image can be found. The application queries this table to find the image: it is an error for the key not to be found there. (Notice how the master-directory table can be rebuilt at any time if necessary, because of the use of globally-unique identifiers.) This hybrid strategy is intended strictly to allow the image-data tables to be maintained a more convenient size, while preserving “the database server” as the means of getting to the data. All of the “smarts” of doing this should, of course, be encapsulated into an opaque Perl object that knows how to Do The Right Thing.™	[reply]
Re: Storing/Retrieving images as blobs by Anonymous Monk on Sep 26, 2011 at 13:46 UTC
I appreciate the time you have taken to highlight the potential areas of problem. I had not thought about the directory structure. I will use the advice given here to come up with a long term solution. Thanks	[reply]


Perl Monk, Perl Meditation
	PerlMonks