comment on

I can store probably around 33 million files without seriously affecting lookup time (256 * 256 * 512 - I'm allowing a 50% margin), but backing up or transferring these files would be a major pain

Gee, I don't know, it would seem that making incremental backups would be easier than with a normal database.

But what about database searches? Seems to me that there I'd also have to have a marged copy of the text there in a processed (punctuation and extra spaces removed) format, and just searching the text as it resides in the database won't be much help.

I doubt if it'd help much. grep is a very efficient way to search inside a file, and its regex syntax seems flexible enough to ignore punctuation and whitespace.

But you definitely should be thinking about about indexed searching, where you make a reversed index of important keywords and a list of every post where it can be found.

Will storing the posts in the database instead of as individual files take up a lot more space

As a general rule, databases are wasteful with space. They all seem to be designed as if disk space was free: if your disk isn't large enough to hold the data, you can just throw more hard disks at it. So no, I don't expect databases to be space efficient, or even recycle wasted disk space by itself. Whatever you do, your home grown solution will likely be a lot more compact.

But be prepared to put a lot of effort in it. To make a decent database-like system from scratch is a lot of hard work, requiring a lot of nifty home-grown solutions, many of those which are simply part of a standard database system.

In reply to Re: Large chunks of text - database or filesystem? by bart
in thread Large chunks of text - database or filesystem? by TedPride

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Syntactic Confectionery Delight
	PerlMonks