Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Anyone using SPOPS to store BLOBs?

by talexb (Chancellor)
on Jun 24, 2003 at 15:05 UTC ( [id://268546]=perlquestion: print w/replies, xml ) Need Help??

talexb has asked for the wisdom of the Perl Monks concerning the following question:

I am looking into using SPOPS to store big lumps of data (PDFs, actually) and was wondering if anyone else has experience doing this.

I am using Postgres as my database, so the PDFs would (presumably) go into a TEXT field.

--t. alex
Life is short: get busy!

Replies are listed 'Best First'.
Re: Anyone using SPOPS to store BLOBs?
by bunnyman (Hermit) on Jun 24, 2003 at 16:02 UTC

    Reconsider storing PDF files as BLOBs. Just store them as regular files and store the path to the file in the DB.

    With lots of large BLOBs, the database gets bloated, the DB dumps become huge, and it can create problems on platforms with a 2GB file limit, like some versions of Linux.

    When you stream a BLOB to a client, you must hold open a DB handle the entire time, and if you try to read the whole BLOB into a Perl variable, you will get a large perl process. Your operating system is designed to store files and you will have a much easier time doing so.

    At first glance, SPOPS doesn't seem suited to this task at all. It is a framework for saving Perl objects (complex data structures) to a database. Files are not good things to store in databases.

      The problem with storing filenames in a database, and storing the files on disk, is that then you need to roll your own remote file access in the case where multiple client machines need access to the same file backstore. While not impossible, it's certainly higher in development cost than using a canned solution if your preferred database provides it. Especially if you need to support remote file locking, atomic cross-system unique filename generation, and so on.

      The moral being: file access is easy when you have one program running on one machine. It becomes more difficult when you have multiple programs sharing the same data on one mahcine, and even more difficult when multiple programs on multiple machines need to coordinate access to shared resources on one or more remote servers. Databases are good at this, but as others have said it can wreck the performance of other relations.

      Alan

        These are indeed good reasons for BLOBS over files: concurrency and remoting.

        Another is transactions over multiple operations, on multiple files/tables, on multiple servers. This is very difficult to do with filesystems. You need to do 2-phase commit, backup directories, etc. With BLOBS, operations are part of the normal transaction.

        Another is consistency: the BLOB fields are handled the same way as the rest of the data. No need to think tables & files. Just tables.

        Besides, BLOBS are not as difficult as they used to be. If you use something like Tangram, and your BLOBS fit in memory, you don't even have to think about it. You use them like regular fields.

      The reason I am looking at using a text field is that Postgres doesn't appear to support BLOBs. However, it looks like the bytea type may be suitable instead.

      --t. alex
      Life is short: get busy!

        what i believe bunnyman is suggesting is that you do not use a database to store the files. instead, store the filenames, and write the files to disk (just don't erase them.)

        ~Particle *accelerates*

      It's a long story, but the short answer is that I really *do* need to store files into a database. It makes them more portable and accessible.

      It's unlikely that we'll approach a file that's 2G -- I have tried various files up to a 12M example that we have, and they all worked the way I expected, except for the side effect of adding single quotes around the entire files' contents when I extract the file back out of the database.

      The only time we'll need to access the file is when we copy it back out to the file system for processing -- taking just the file name isn't going to work for us because that file may no longer be available when we want to do our processing.

      --t. alex
      Life is short: get busy!
        It's unlikely that we'll approach a file that's 2G

        What I meant was the entire database could become 2GB in total size after adding enough files to it. If the DB is stored in one file, that would be a problem unless your system supports big files or the database is split over a few files.

        I understand the initial thinking of storing BLOB’s in the database but I don’t think a 2G database is more portable than a 1M database with a tree structure of subdirectories that hold the PDF’s. The tree structure can be Tared/Gzipped at any time to copy in bulk.

        I have worked with a VLDB that had BLOB’s in it and the performance was poor. The design team decided to go with file names and pseudo paths to find the images. This repository is one of the largest in the world of GIF and JPG images, a competitor to Getty and now works well.

        Maybe I am missing something here, but why is remote access to this database harder than normal if the PDF’s are stored separately? The file locking for user access should be done at the database level, sort of like a check in and check out system.

        I guess I am really curious as to why the PDF’s have to be embedded, bore me with the long story :}

        Richard

        There are three types of people in this world, those that can count and those that cannot. Anon

      Apparently Postgres has a 16TB limit for tables, and a BLOB limit of 2G per chunk. If Postgres is used to store lots of BLOBs, I am guessing that it actually stores them in (drumroll please) the file system!!

      --t. alex
      Life is short: get busy!
Re: Anyone using SPOPS to store BLOBs?
by lachoy (Parson) on Jun 25, 2003 at 02:48 UTC

    You can access TEXT fields normally, but it may be the case that once you stuff n characters in the field it becomes a blob stored outside the table. (Do a '\dl' in psql to get a listing of all LOBs stored in the db.)

    Looking at the DBD::Pg docs you need special operations to work with LOBs. (Fortunately for framework authors, these operations vary from database to database...) You can either read it in all at once or a chunk at a time, both using the blob_read function of the statement handle.

    What I'd do is handle it outside of SPOPS entirely. If this is the only field in the table then probably wouldn't use SPOPS for the table. If you do use it SPOPS can be told to only deal with certain fields of a table by listing the fields in the 'field' configuration key. Just don't list your PDF field there and create a method (likely inherited or delegated since it should be a fairly generic operation) to use the DBD::Pg functionality to stream the LOB into a file. Then you have no memory worries and can use SPOPS normally.

    Hope this makes sense.

    Chris
    M-x auto-bs-mode

      After some experimentation and some RTFM-ing I realized that of course I wanted 'bytea' fields instead of 'text' fields.

      Currently we use a Documents table with a bunch of information in it as part of our system; including the file itself would smooth a lot of things out -- we'd have to copy the file from Postgres to the file system for processing but (this is the cool part that I love) any files that were created during processing can then be stored back into the database. This means that we have the ability to back-track to previous iterations. (For obvious reasons, I can't describe my system in a lot of detail.)

      Your advice (third paragraph) does make sense .. thanks for the pointers. I'll be reading up on SPOPS more this week to see if we can fit your wheel in somewhere rather than build our own.

      --t. alex
      Life is short: get busy!
Re: Anyone using SPOPS to store BLOBs?
by Anonymous Monk on Jun 25, 2003 at 18:59 UTC
    I agree with the people that stated that storing the pdf file in the database is probable not the best idea. When I workd for a online newspaper I was given an assinment of comming up with a way to store, large in my case, PDF files. I oped out of storing them into the database due to the size of the files and the amount of request there would be for the file. So this is what I did.

    1. I created a unique filename for the PDF file.
    2. Zip up the PDF file and stored it an a secure location on the server under the new unique filename.
    3. Entered the name of the zipped file into the database alonge with all the file characteristics for easy searching.

    When someone needed the file, the retrival program woulud unzip it in a tmp directory and upload it to the user then delete the temp file. This saved me space in the db and on the server itself. All I had to due was makesure that the directory with the ziped files where backedup at the same time as the database and all was good in life.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://268546]
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-26 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found