Re^2: Efficient way to handle huge number of records?

Ok, first of all thanks to you all for answering to me. Let me explain thoroughly my task:
I have a large database of records as I wrote before, in the form of:

NAME1
SEQUENCE1
NAME2
SEQUENCE2
[download]

etc. The NAME is usually 50-60 chars (small) but the SEQUENCE can be from 100 chars to maybe 2-3000 chars as you also pointed out.
What my script does (or I want it to do) is:
The user supplies an unknown query input, then I run a program called BLAST which tells me that this unknown input had 1,2,3....50...60...200 hits in my database. So, I know the NAME of each hit. My problem is that each time I need to look up the database file and retrieve the respective entries and then create a new file (let's say output file) which will have all NAME and SEQUENCE of the hits to process it further.
Plus my ultimate purpose it to create a webserver, so therefore this database file will be accessed quite frequently, that's why I am asking whether a database or a simple local search through a hash (at least that's what I can think about) is more recommended.

Comment on Re^2: Efficient way to handle huge number of records? Download Code

Replies are listed 'Best First'.
Re^3: Efficient way to handle huge number of records? by wfsp (Abbot) on Dec 11, 2011 at 13:34 UTC
An alternative to building a hash in memory would be to have a disc based hash like DBM::Deep. Size is limited by the size of your disc, it is fast and it is particularly suited to “write seldom, read many” which appears to be your case. However, it seems from your description that the “unknown query” would parse/search on the value of each key (“search through a hash”) i.e. look at each value in turn for each query. I always think of a hash as a “lookup table” and it doesn't look like this is what you'll be doing. It maybe that an array would be more suitable, less overhead at least. Something like `NAME1\|SEQUENCE1`. According to the docs D::D can build disc based arrays too although I have never tried. How unknown are your unknowns? Could there be general subsets/groups that could be preprocessed; you say there is additional processing to be done on the results – could that be done upfront? A series of additional lookups that could identify a smaller group of records to search – divide and conquer? If you did choose the RDBMS route are there indices that could be built that would help find what you're after. If there aren't it would, imo, likely negate any advantage. If you have to process each record for each search then whatever method you adopt is going to be slow and possibly not suitable for a website. If you could up come with an outline of any known unknowns there is a good chance the monks could give you some pointers on how to avoid what seems like a brute force approach. It would be a challenge many of them would be unable to resist. :-)	[reply] [d/l]
Re^4: Efficient way to handle huge number of records? by jethro (Monsignor) on Dec 11, 2011 at 14:45 UTC
I think you are reading too much into his use of the word "search", but you have a point there. If he ever wants to do lots of searches for parts of a key (i.e. "contains" instead of "equals"), a hash will be as bad as a simple array.	[reply]
Re^3: Efficient way to handle huge number of records? by jethro (Monsignor) on Dec 11, 2011 at 14:21 UTC
Wait, is the program called BLAST something you already have or is that the script you want to create? I.e. why would you use BLAST to search and then use a different script for retrieval? Obviously when the retrieval script creates that output file the number of hits is available with zero effort If you want to use your script with a webserver you probably will observe the following with the different solutions: A) perl CGI script, loading hash in memory: Really bad, every search has to load the database into memory. Even if that takes only 4 seconds it is mayor disk wear and slow. Same with Storable B) Using mod_perl, loading hash into memory: With a trick (using a BEGIN block to load the hash) you could read the data only once into memory. Big improvement on execution speed, but memory could be a problem as apache usually creates multiple instances of the web server each running the script and data. Depends on the server engine you use, a threaded engine probably could share the data but not a forking engine (Me no expert here, someone hopefully will correct me if I'm wrong). Same with Storable C) Using SQLite, DBM::Deep, ...: Disk based methods. Data is not loaded into memory, an index keeps retrieval reasonably fast. Obviously somewhat slower than a simple hash in memory, but still simple to program. No problem if your database grows in size (within limits), no noticable startup time. If the database does mostly reads and seldom writes arguably the best solution in your case IMHO D) Using mysql or postgresql: Apart from the web server you have to install and manage another server (easy on linux (easy for me, not at all easy for my mother ;-) ), not so easy on windows(?)). Reading and changing data in the database never is a performance problem irrespective of size. Complex databases possible (which you don't seem to need), the database server will get bored to death in your case ;-). Probably a notch slower than solution C, but the only solution if you want 100 simultaneous apache threads reading and writing to that database.	[reply]


Perl: the Markov chain saw
	PerlMonks