Assuming MySQL, BLOB types are variable length. The length you store + 1, 2, 3 or 4 bytes to store the length.
Column type Storage required
CHAR(M) M bytes, 1 <= M <= 255
VARCHAR(M) L+1 bytes, where L <= M and 1 <= M <= 255
TINYBLOB, TINYTEXT L+1 bytes, where L < 2^8
BLOB, TEXT L+2 bytes, where L < 2^16
MEDIUMBLOB, MEDIUMTEXT L+3 bytes, where L < 2^24
LONGBLOB, LONGTEXT L+4 bytes, where L < 2^32
ENUM('value1','value2',...)1 or 2 bytes, depending on the number of en
+umeration values (65535 values maximum)
SET('value1','value2',...) 1, 2, 3, 4 or 8 bytes, depending on the num
+ber of set members (64 members maximum)
Which should ease your concerns a little. You could improve things a bit by zipping the data on the way in, and unzipping it on retrieval, but that comes at the cost of being able to use the (somewhat slow and primitive) SQL search facilities.
Then again, you could index (using perl rather than the DB), the words in each post when it is created, which would allow you to locate the posts via keywords very quickly. You could store the index in the DB either as Storable hash or as a table, though the latter might be costly in terms of space and time.
And, once you have built an index of the words in a post, why store the words? If you fully invert your posts, you could probably get away with storing a 24-bit number to represent the "words" in the posts", which would probably represent an overall compression as you would be discarding the spaces.
The same file/table/frozen hash that you use to relate index entries to words, could also store post numbers where the words appear. You now have the basis of fairly well compressed, highly indexed datastore the allows very fast search and retreival.
Just musing :)
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco.
Rule 1 has a caveat! -- Who broke the cabal?
|