Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Large chunks of text - database or filesystem?

by TedPride (Priest)
on Mar 19, 2005 at 13:05 UTC ( [id://440916]=note: print w/replies, xml ) Need Help??


in reply to Large chunks of text - database or filesystem?

I don't have a lot of leeway in what I can do, though. This would be for a remote hosting account without access to cron (though I can fake that) or huge amounts of disk space (probably around 300-500 MB). Assuming the worst-case scenario of 300 MB and around 65K per post (64K for blob + necessary fields), we're talking a max number of around 4600 posts, which isn't in my opinion acceptable.

Does a blob require the max amount of space regardless of the size of its contents?
If yes, is there a way to specify a fixed length field of more like 10K in size?

  • Comment on Re: Large chunks of text - database or filesystem?

Replies are listed 'Best First'.
Re^2: Large chunks of text - database or filesystem?
by BrowserUk (Patriarch) on Mar 19, 2005 at 13:53 UTC

    Assuming MySQL, BLOB types are variable length. The length you store + 1, 2, 3 or 4 bytes to store the length.

    Column type Storage required CHAR(M) M bytes, 1 <= M <= 255 VARCHAR(M) L+1 bytes, where L <= M and 1 <= M <= 255 TINYBLOB, TINYTEXT L+1 bytes, where L < 2^8 BLOB, TEXT L+2 bytes, where L < 2^16 MEDIUMBLOB, MEDIUMTEXT L+3 bytes, where L < 2^24 LONGBLOB, LONGTEXT L+4 bytes, where L < 2^32 ENUM('value1','value2',...)1 or 2 bytes, depending on the number of en +umeration values (65535 values maximum) SET('value1','value2',...) 1, 2, 3, 4 or 8 bytes, depending on the num +ber of set members (64 members maximum)

    Which should ease your concerns a little. You could improve things a bit by zipping the data on the way in, and unzipping it on retrieval, but that comes at the cost of being able to use the (somewhat slow and primitive) SQL search facilities.

    Then again, you could index (using perl rather than the DB), the words in each post when it is created, which would allow you to locate the posts via keywords very quickly. You could store the index in the DB either as Storable hash or as a table, though the latter might be costly in terms of space and time.

    And, once you have built an index of the words in a post, why store the words? If you fully invert your posts, you could probably get away with storing a 24-bit number to represent the "words" in the posts", which would probably represent an overall compression as you would be discarding the spaces.

    The same file/table/frozen hash that you use to relate index entries to words, could also store post numbers where the words appear. You now have the basis of fairly well compressed, highly indexed datastore the allows very fast search and retreival.

    Just musing :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
Re^2: Large chunks of text - database or filesystem?
by bpphillips (Friar) on Mar 19, 2005 at 13:41 UTC
    BLOBs are variable length (at least in MySQL, blob storage requirements). There are also MEDIUMBLOBs which hold up to 16MB and LONGBLOBS which hold up to 4GB, all as variable length fields requiring at most 4 bytes more than the data to store the length of the data.

    If you happen to be fortunate to have a hosting account with MySQL v4.1.1 or greater you might be able to use the (de)compress() function to (de)compress the data on the fly in MySQL. Or for that matter, you could just compress it using perl before inserting it in the table.

    For searching, plucene is a really cool product but I've only used it in relatively small settings (i.e. indexes of under 30 MB). I've heard differing opinions on plucene's speed so it my not be an acceptable option for your case. MySQL also offers full-text searching functionality since v3.23.23 but that would prevent you from doing compression of the data.

    As another alternative, I recently heard of another effort (still in alpha) that was inspired by Plucene but with the goal of overcoming Plucene's performance problems. It's called Kinosearch. I have no actual experience with it but it appears to be promising (I'd be interested in hearing from anyone else that's actually used it).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://440916]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-25 17:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found