Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Efficient way to handle huge number of records?

by Anonymous Monk
on Dec 11, 2011 at 08:35 UTC ( [id://942908]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!
I have a large text file, which is assembled of ~11000000+ records in the form of:
>NAME1 SEQUENCE1 >NAME2 SEQUENCE2 >NAME3 SEQUENCE3

I then have my perl script which reads some results and then wants to extract only certain records from this huge text file. It could be 100 records, 10 records, 45 records etc, this is based on my results.
My question is, since I suppose it would be rather time consuming to read the file each time searching for the desired record:
(I) can Perl handle a hash of tha size? Like 11000000 keys and values? In that case, as I can understand, the script would have to create the hash each time, so it would be rather time-consuming, no?
(II) would it be more efficient to store these records in a Mysql database and then retrieve them based on the NAME for example? Can the database store that many items without trouble? I suppose yes?
Thank you!

Replies are listed 'Best First'.
Re: Efficient way to handle huge number of records?
by BrowserUk (Patriarch) on Dec 11, 2011 at 10:05 UTC
    1. can Perl handle a hash of tha size?

      Yes. If your system has sufficient memory.

      For a 32-bit Perl, you're limited (in most cases) to 2 GB of ram (regardless of how much memory the machine has installed), which is probably not enough for the size of file you've outlined. A hash holding 11 million pairs with 40 char keys and 80 char values requires ~2.4 GB.

      If you have a 64-bit OS and a 64-bit perl installed, then your memory is limited by ram (+plus swap space, but don't go there). A reasonably modern commodity box with 4GB of ram and 64-bit OS & Perl would easily handle your 11 million records assuming your description means that each description and sequence is limited to ~80 characters each.

      However, the file format you allude to (FASTA) frequently has multiple lines of sequence for each ID, and their sizes can individually be many 1000s or even millions of codons. So whether your machine can handle the dataset you need to work with depends upon not just the number of the records, but also their sizes.

      As a rough rule of thumb, anticipate 100MB per 1million records + the size of the file on disk in MBs.

      Eg. 11,000,000 records with 40 chars IDs and 80 char sequences = 11 * 100MB + 11e6 * 120 = 1.1GB + 1.2GB = 2.3GB total memory requirement.

    2. In that case, as I can understand, the script would have to create the hash each time, so it would be rather time-consuming, no?

      Not as long as you might think.

      This create a 1 million key fasta file and reads it to create hash:

      C:\test>perl -E"sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] }; +; say('>',rndStr( 40, 'a'..'z')),say(rndStr(80,qw[a c g t])) for 1 .. + 1e6" >junk.dat C:\test>prompt [$t] $p$g [ 9:41:54.22] C:\test>perl -E"$h{<>}=<> until eof(); print scalar keys + %h" junk.dat 1000000 [ 9:41:57.88] C:\test>

      Under 4 seconds to load up the hash. So for 11 million keys, it will take less than a minute to load up the file. Assuming you have enough free memory to hold it.

      It will take longer if your sequences are substantially larger.

    3. would it be more efficient to store these records in a Mysql database and then retrieve them based on the NAME for example?

      That really depends a lot upon how many times you are going to use the data?

      Remember that when querying data from a DB, it still has to be loaded from disk. So do the indexes. And for an RDMBS like MySQL, then the data has to be serialised through a socket or pipe.

      For a one-time, or just a few uses of the dataset, the time taken to set up the DB is pure overhead which can negate any gains. If you have to download and install the DB first, and then learn how to set it up and uses it. a flat file and a hash wins hands down.

      If however you have teh DB, know how to use it and are going to be accessing the same dataset many times over an extended period, then the equations may swing the other way.

    4. Can the database store that many items without trouble?

      Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      BrowserUk,

      Just to comment on the 32/64 bit (re. Linux), a 32 bit Linux system can support 64GB of memory, but each process is limited to 4GB ( usually 3.7GB ). So each Perl application would have approximately 3.7 GB to handle script and data.

      In the windows world, you are absolutely correct.

      Thank you

      "Well done is better than well said." - Benjamin Franklin

        Hm. You're stretching several boundaries beyond their limits there:

        The 32bit memory mapping system supports either 4GB of address space without PAE or 64GB with PAE. But that does not necessarily tell you how much Linux supports with/without PAE.

        Linux also introduces constraints on total physical memory based on interactions with the way it manages kernel virtual memory. That leads to at least four different levels of memory support based on choices made during kernel build.

        The lowest level is 896MB without PAE
        The next level is about 3.25GB (bios limited) without PAE
        The next level is, I think, about 16GB, with PAE

        The highest level, I think, is the full 64GB with PAE plus an ugly kludge in kernel virtual memory (I think a bad idea. Use 64bit instead for that much ram).

        Win32 can also (and I believe was first) to do Page Address Extension (PAE). It can also extend the default 2GB user space to 3GB per process. But just like linux, these limits are extended through a series of cludges that have drawbacks as well as benefits.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

      Any DB that couldn't handle that few records would not be worthy of the name. Even MySQL or SQLite shoudl easily handle low billions of records without trouble.

      I would be quite interested to see SQLite do this. (may even try it myself...)

      In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use.

      I'd love to hear that this has improved - SQLite is nice, when it works. Does anyone have recent datapoints?

      (As far as I am concerned, Mysql and BerkeleyDB, as oracle products, are not an serious option anymore (I am convinced Oracle will make things worse for non-paying users all the time), but I am interested to know how their performance (or Oracle's itself for that matter) compare to PostgreSQL)

        That is not true. the SQLite is the fastest DB engine I ever come across. You just need to increase the buffer size of the read input to let say 4 MB within a transaction and you will see that it can import the above values with no problem under a minute where MySQL will take much longer. And since it stores everything in a RAM query time is going to be again much much more faster. So if you or anyone is looking for a fast DB engine without some fancy-shmancy features that i rarely use anyway the SQLite is the way to GO.

        So if you need a db engine that is fast and reliable and can deal with lots of data you will want SQLite.

        Now as far as the initial question goes, you can do something similar to what MySQL does. Yu could split the file into chunks and index chunks by the line numbers so that you know in which line does the header of you sequence appear. Once you did that you need to hash only those indexes. This will reduce the search the number of times prop. to the number of fragments you have after chomping your initial

        In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow

        I said "handle" not "handle well" :)

        That said, I had SQLite on my old machine and found that .import file table via the sqlite3.exe was substantially faster than doing inserts via SQL. Whether from the command line utility or via Perl & DBI.

        I wish I could get a 64-bit build for my system.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        The start of some sanity?

        erix,

        Sorry to take so long, real work got in the way.

        Hears the updated info, I used BrowserUk's sub to generate the data. On the key part is was 40bytes long, but on the data part it was 320bytes, so I 'substr' it to 80 (If 320 is correct, I can run the tests again). I ran the test for 1_000_000, since for your purposes it didn't matter. Also I multipled the times by 1000 to get the results in milliseconds. I generated the random keys at the beginning, so that they wouldn't be in cache.

        Here it is:

        while ( $cnt < $howmany ) { $key = rndStr( 40, 'a'..'z'); $data = substr(rndStr(80,qw[a c g t]),0,80)); if ( ( ( $cnt % 113 ) == 0 )&&( scalar keys %khash < 10 ) +) { $khash{$key} = 0; } . . . for ( 1 .. 4 ) { foreach $key ( keys %khash ) { $stime = gettimeofday; $ret = $cursor->c_get($key, $data, DB_SET); $etime = sprintf("%.6f",(gettimeofday - $stime) * 1_000 +); print " $key Time: $etime ms\t$hkey\n"; } }

        Running it and Output:

        # time perl Show11M_mod.plx cds_enabled ## Start: VSZ-10292_KB RSS-4828_KB BLOCK: 512 ( 1000000 ) Write: 1049.66578292847 952/sec 1000000 ReadNext: 28.9542100429535 34537/sec Total: 1000000 ## End: VSZ-10292_KB RSS-6284_KB Diff:0|1456_KB BLOCK: 512 rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.164032 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.089884 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.064135 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.066996 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.050068 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.043869 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.052929 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.050068 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.053167 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.057936 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.012875 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.011921 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.010967 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010967 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.011206 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010967 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.012159 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010967 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.011921 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.012159 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.012159 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010014 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.010967 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010014 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.010967 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010014 ms rijrxyzhfvfhvpktkiedvmnpwdphswhavejjwqvr Time: 0.011921 ms evxacpuyerimyidhwfqnvqsjqzrdpgwxzywssakk Time: 0.011921 ms qrckdiakaaanjsrnvsswzuebxmtxeaznhpwdqgfn Time: 0.010967 ms pxlyvhbaujsfdwzsdjterlqeiothhpdzljizypbi Time: 0.010967 ms wfbqhvgjnltboojbctaszbaxlcwibjdjgmwzcusu Time: 0.010967 ms ukotkvoceuchbrrdegkixjdegzqclfxbwkdvrnkj Time: 0.010967 ms dcrcpnxnuhfrwmysbxnfmbzqhgeblvoyczoqboef Time: 0.010967 ms xsgzxvlivfwqirwmpjpdnbtifuvjqmbthmgtnbxh Time: 0.010967 ms qntwonibxslleldmlvanodhzlqhweeihlsarfznj Time: 0.010967 ms rpflfufduuqvtkydqswvgnyionloswworrdraplt Time: 0.010967 ms real 18m17.387s user 1m52.459s sys 0m34.850s

        Regards...Ed

        "Well done is better than well said." - Benjamin Franklin

        In the past (last time I tried was, I think, a couple of years ago) SQLite always proved prohibitively slow: loading multimillion-row data was so ridiculously slow (even on fast hardware), that I never bothered with further use.

        I showed the framework of some code at Re^3: Efficient way to handle huge number of records? which is an abbreviated version of some code that I'm currently working on. One table has a million records of 50 fields. So I ran a couple of tests.

        First test was with all the speed-up stuff turned off:

        Starting: Fri Dec 16 13:56:56 2011 Creating new Database!! - HD table records inserted: 1,117,526 Ending: Sun Dec 18 02:15:30 2011
        Now I figure that qualifies as "ridiculously slow!". I actually had to run it twice because I got one of those "Windows Automatic Reboot was required" things! Bummer when that happens after one day of processing!

        Using the optimizations and by FAR and away the biggest effect is to do a single transaction! Results in:

        Starting: Sun Dec 18 15:26:53 2011 Creating new Database!! - HD table records inserted: 1,117,526 Ending: Sun Dec 18 15:29:44 2011
        Or about 3 minutes instead of 2 days! A lot better! This is fast enough for my needs. Using the bulk import utility would probably be faster, but I haven't tested that because 3 minutes doesn't bother my application.

        I have another app that builds a 500K record table and it builds it from 1,000 input files. Takes about 1/2 the time or about 90 seconds. Its not worth my programming effort to emit an intermediate file in the whatever the bulk import utility needs - I just put the data into the DB right away. A reasonable programming tradeoff. Mileage varies.

        It should be noted that my machine is an older one, a hyper threaded one (before the multi-core days), the Prescott stepping - last one with PGA (pin grid array) and my disks are only 5K rpm (not 7K+). A more modern machine can run a single process at least 4x this fast or about 45 seconds instead of 3 minutes (I've bench marked my machine vs a friend's on similar tasks before).

        The time scales linearly, so 11M records would take 10x as long. Is that "ridiculously slow?", I don't know. I guess that depends upon the application.

        I do have a MySQL server running on my machine and in the past I've done some benchmarking vs SQLite. For a complicated query, MySQL is faster, but for my current projects, SQLite "wins" due to admin simplicity (none required!).

Re: Efficient way to handle huge number of records?
by Marshall (Canon) on Dec 11, 2011 at 09:02 UTC
    This looks like a Bio Perl problem. I would look at those Perl modules.

    11,000,000 records is pretty big! If you want to do a lot of searching, then an SQL DB is a good idea. I recommend SQlite although MySQL is fine albeit there are some admin hassles. SQLite has an advantage in that the DB is just a single file and all of this admin password and setting up accounts is not necessary. However, 11 million records isn't all that big as a hash table.

    If you load this into a SQLite DB and then generate an index based upon NAME. A name based query will run fast.

    What have you tried so far?

    Either MySQL or SQLite can handle this number of records easily.

      Actually, i would be rather cautious when using MySQL. All the trade-offs and bugs depending on which data backend you use. These days, i really recommend switching to PostgreSQL.

      The license is more open (BSD instead of mixed GPL/Commercial), the speed is very much OK and the database is fully ACID compliant - which is getting rather important these days since sooner or later you'll probably run into the issue of porting your programs and modules to run on a multithreaded webserver (a.k.a "the cloud"). Making your stuff accessible through the Web seems to be the trend these days. Ask any manager what buzzwords he wants to use in his next speech to the shareholders and "cloud computing" and "HTML5" seem to top the list ;-)

      Ok, this was my personal point of view on this issue. Here is a (long) comparison about the two databases.

      Don't use '#ff0000':
      use Acme::AutoColor; my $redcolor = RED();
      All colors subject to change without notice.
        For single user, single process, I figure that SQLite has it. And I recommend it for those situations. It is completely Open Source.

        I think that your point about PostgreSQL is a good one! But I don't think the OP needs PostgreSQL or MySQL.

        From what has been described, the SELECT functions will be only on the NAME and that is easy and SQLite will do the job.

      Yes true it's bioinformatics, but I don't need any modules, I have my data in the text file...
      I wanted to start using Mysql with Perl but I can't seem to be able to install the Mysql module. I tried CPAN but the command:
      perl -MCPAN -e 'install Mysql'

      returns:

        OK. I recommend that you not mess with MySQL. There is more pain past just getting the DBI module. You will have to get a MySQL daemon started and there is admin hassle.

        I recommend that you start with SQLite. There is no daemon. There is no admin. It does 90% of what MySQL does for a single user and with a lot less hassle.

        I use the SQLMangager plug-in for firefox instead of the command line I/F which I have never bothered to install on my machine. You will need to install the DBD::SQlite module and that's about it.

        Try installing DBD::SQLite and see how it goes.

        Update:
        I do have a MySQL server running right now on my Windows machine.
        I know how to do it and that is why I am advising you not to do it!
        Too much hassle!

Re: Efficient way to handle huge number of records?
by leuchuk (Novice) on Dec 11, 2011 at 10:10 UTC

    Answer to question (I):

    I don't think it's a problem for Perl to handle this size. The problem could be your computer (and may be your patience).

    A hash needs as far as I remember a minimum of 55 bytes per hash key + the content. 11 million times 55 = 605 million bytes = 576 MByte for the overhead... Do you have enough memory for the whole file + overhead + OS + ...? So when your hardware doesn't have enough memory it's swapping and that can make your solution unacceptable slow.

    Question (II): I assume that you are no database expert, are you? 11 million records is no problem at all (depending on your table definition), but your data may be. For example if your sequence just fits in a certain context. So that NAME2 is to be interpreted different depending on the record NAME1 before and/or the record NAME3 after. But if this would be the case an hash isn't the right solution, too.

    Just bringing these data in your database may be just half of the solution. The full answer depends on your way to find these 100, 10 or 45 records. Your reading algorithm should be good enough to read the database in a rational way meaning the Perl script doesn't read the records a zillion times... A database is no solution, too, if you just read the file sequentially (a "SELECT * from table;") compute your results then somehow and pick your 100, 10 or 45 records. The overhead for storing your data in the database and retrieving it could be killing all wins by using a database.

    Marshall proposed SQLite. This is really a good solution, especially if you work with a limited number of processes accessing the data base at the same time. Personally I wouldn't take MySQL, I'd prefer PostgreSQL, but this is just my opinion.
      Ok, first of all thanks to you all for answering to me. Let me explain thoroughly my task:
      I have a large database of records as I wrote before, in the form of:
      NAME1 SEQUENCE1 NAME2 SEQUENCE2
      etc. The NAME is usually 50-60 chars (small) but the SEQUENCE can be from 100 chars to maybe 2-3000 chars as you also pointed out.
      What my script does (or I want it to do) is:
      The user supplies an unknown query input, then I run a program called BLAST which tells me that this unknown input had 1,2,3....50...60...200 hits in my database. So, I know the NAME of each hit. My problem is that each time I need to look up the database file and retrieve the respective entries and then create a new file (let's say output file) which will have all NAME and SEQUENCE of the hits to process it further.
      Plus my ultimate purpose it to create a webserver, so therefore this database file will be accessed quite frequently, that's why I am asking whether a database or a simple local search through a hash (at least that's what I can think about) is more recommended.
        An alternative to building a hash in memory would be to have a disc based hash like DBM::Deep. Size is limited by the size of your disc, it is fast and it is particularly suited to “write seldom, read many” which appears to be your case.

        However, it seems from your description that the “unknown query” would parse/search on the value of each key (“search through a hash”) i.e. look at each value in turn for each query. I always think of a hash as a “lookup table” and it doesn't look like this is what you'll be doing. It maybe that an array would be more suitable, less overhead at least. Something like NAME1|SEQUENCE1. According to the docs D::D can build disc based arrays too although I have never tried.

        How unknown are your unknowns? Could there be general subsets/groups that could be preprocessed; you say there is additional processing to be done on the results – could that be done upfront? A series of additional lookups that could identify a smaller group of records to search – divide and conquer? If you did choose the RDBMS route are there indices that could be built that would help find what you're after. If there aren't it would, imo, likely negate any advantage.

        If you have to process each record for each search then whatever method you adopt is going to be slow and possibly not suitable for a website.

        If you could up come with an outline of any known unknowns there is a good chance the monks could give you some pointers on how to avoid what seems like a brute force approach. It would be a challenge many of them would be unable to resist. :-)

        Wait, is the program called BLAST something you already have or is that the script you want to create? I.e. why would you use BLAST to search and then use a different script for retrieval? Obviously when the retrieval script creates that output file the number of hits is available with zero effort

        If you want to use your script with a webserver you probably will observe the following with the different solutions:

        A) perl CGI script, loading hash in memory: Really bad, every search has to load the database into memory. Even if that takes only 4 seconds it is mayor disk wear and slow. Same with Storable

        B) Using mod_perl, loading hash into memory: With a trick (using a BEGIN block to load the hash) you could read the data only once into memory. Big improvement on execution speed, but memory could be a problem as apache usually creates multiple instances of the web server each running the script and data. Depends on the server engine you use, a threaded engine probably could share the data but not a forking engine (Me no expert here, someone hopefully will correct me if I'm wrong). Same with Storable

        C) Using SQLite, DBM::Deep, ...: Disk based methods. Data is not loaded into memory, an index keeps retrieval reasonably fast. Obviously somewhat slower than a simple hash in memory, but still simple to program. No problem if your database grows in size (within limits), no noticable startup time. If the database does mostly reads and seldom writes arguably the best solution in your case IMHO

        D) Using mysql or postgresql: Apart from the web server you have to install and manage another server (easy on linux (easy for me, not at all easy for my mother ;-) ), not so easy on windows(?)). Reading and changing data in the database never is a performance problem irrespective of size. Complex databases possible (which you don't seem to need), the database server will get bored to death in your case ;-). Probably a notch slower than solution C, but the only solution if you want 100 simultaneous apache threads reading and writing to that database.

Re: Efficient way to handle huge number of records?
by educated_foo (Vicar) on Dec 11, 2011 at 16:00 UTC
    If you're only doing exact matches on the sequence tags, a database is overkill, and will probably be slower than an in-memory hash. If a hash mapping tags to sequences is too big, you can save quite a bit of space by (once) mapping tags to file offsets in your original file (untested):
    use Storable; open IN, "input.fa"; my %id2off; while (<IN>) { next unless /^>/; chomp; $id2off{substr($_, 1)} = tell(IN); # file pointer points to start # of next line } store \%id2off, 'input.idx';
    Then later (many times) using this index to retrieve the sequence data (also untested):
    use Storable; $id2off = retrieve 'input.idx'; open IN, 'input.fa'; while (<>) { chomp; next unless exists $id2off->{$_}; print ">$_\n"; seek(IN, $id2off->{$_}, 0); # move file pointer to start of sequen +ce while (<IN>) { # print until we reach next sequence last if /^>/; print; } }
    Sorting your data won't really work so well since, with variable-length records, you have to do a linear scan of the file to find the linebreaks.
      See Bio::DB::Fasta for this implementated already.
Re: Efficient way to handle huge number of records?
by erix (Prior) on Dec 11, 2011 at 16:39 UTC

    On a slow desktop (which is also busy doing other database stuff), I get the below performance with PostgreSQL (9.2devel git/HEAD).

    With 10,000,000-sequence fasta was made (with BrowserUK's script, adapted - thanks). This yielded a 1.2G junk.dat file:

    -rw-rw-r-- 1 aardvark aardvark 1.2G Dec 11 15:13 junk.dat grep -c '^>' junk.dat 10000000 # one-off: create database table 'junk': time < junk.dat psql -c "drop table if exists junk; create table junk(accession text, sequence text); copy junk from stdin csv delimiter E'\t';" ; real 1m35.215s # one-off: create index on database table 'junk': time echo " create index junk_idx on junk (accession); analyze junk;" | psql real 7m58.633s

    Now the interesting part: retrieval (I don't know why others are not posting retrieval results?).

    I retrieved 10 random rows, and did that 4 times in succession, to show caching effect.

    name protein (or dna?) retrieval time >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 160.624 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 124.207 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 46.639 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 74.509 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 83.611 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 69.637 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 37.553 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 94.236 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 56.977 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 50.896 ms >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 3.799 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 2.029 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.943 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 1.883 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.046 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 2.529 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.863 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 1.952 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 2.042 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 2.011 ms >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 2.632 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 1.992 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.889 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 5.196 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.091 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 2.177 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.877 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 2.092 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 12.836 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 1.867 ms >ewwhsefmyqqadboyjannenpqvtntiqjjheuwdvuh Time: 1.965 ms >braniambsateowfjdmbgsmflbsosllhzureeponx Time: 1.929 ms >aulshznwumwcbqcgamrgpgysmkbkykjmdypqgglc Time: 1.857 ms >sadwirrivjzbezsrgcxpeavketadfvszwamxzwer Time: 1.999 ms >pqrjkynzzeromipzagvunnudrygxbdvokghnkmes Time: 2.009 ms >mfjqtvvvoibwubqyrshwtybpdyexxsrsyndyjbgc Time: 1.981 ms >dwiepreektknecinzqzbyjwsxcjblszxqhflhivd Time: 1.865 ms >ifuwdcrwwofedjeehdheryzcabhkpfnbnfqcddkk Time: 1.880 ms >thfhatnyothgvtmwwdsflyukanykawezrlvnkgyy Time: 1.878 ms >qspsniccydlcrbcinkmbtoryktkuysvhkpziamnu Time: 2.158 ms

    That's not so bad, is it? (And remember this desktop (2GB, AMD Athlon(tm) II 160u, 800Hz -- yes it's slow) is in the background doing searches (indexed, but also full table scans) of uniprot, which, as you perhaps know, has more than 18M records.

    (If you cannot get mysql installed all the above may also be a bit over your head; nevertheless you should consider to learn database work -- it will often come in handy in bioinformatics. And since I've now done the work, I'll post it, for what it's worth ;) )

    (I may update the timings when the machine becomes otherwise idle)

      Only 5 million records because I have an image analysis that has been running for 2 days now using over half my memory and I don't want to interrupt it.

      I retrieved the first 1000 records from the hash -- as near random as makes no difference -- because 10 wasn't representative. Retrieval time is 0.4 millisecs per.

      The code:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; our $N //= 1000; my $start = time; my %dna; keys %dna = 5e6; $dna{ <> } = <> until eof(); printf "Loading %d records took %d seconds\n", scalar keys %dna, time() - $start; $start = time(); my( $id, $seq ) = each %dna for 1 .. $N; printf "Took %.6f seconds per retrieval\n", ( time() - $start ) / $N; __END__ [22:14:18.06] C:\test>junk48 -N=1e3 junk.dat Loading 5000000 records took 37 seconds Took 0.000414 seconds per retrieval

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

      erix

      while ( $cnt < $howmany ) { my $st = $cnt % 58; my $chr = chr( 122-$st ); $key = "$chr-$cnt";

      If you notice in the code, I commented out 2 lines and changed the 3rd to "A-$cnt", since the poster didn't care about random. I do, since that affects some DBs. If you want, I can run it so that after 30 seconds, I select a record, and then after completion, I'll read it back 1 time, and then 10 times with timing.

      Is that what you want?

      The system is 2-core 2GB AMD, 2.5 or 2.8 MHz. When I get you the times, I'll find out exactly what it is, but to compare I only run on 1-core.

      Thank you

      "Well done is better than well said." - Benjamin Franklin

Re: Efficient way to handle huge number of records?
by mrguy123 (Hermit) on Dec 11, 2011 at 11:41 UTC
    Hi there,

    If the text file is the same each time, a good option you could use that is simpler than using SQL is using storables.

    This way, you can load the data into a hash once and then store it as a serialized file which can be retrieved fairly quickly. I also had a similar bioinformatics problem which was solved this way
    Good Luck!
    MrGuy
      Yes that might be actually the solution to my problem!
      I have always the same kind of database file, which is:
      >protein1 ASKRKGRKGTTRHTRHTRHTRHTHRHTR >protein2 REHGTRYUYTJUYJYTJUJUJUJYJUYJUY >protein3 EWFDEWFEFREFRE

      Apparently I'll create a "big" hash with protein1, protein2, protein3 etc as keys and ASKRKGRKGTTRHTRHTRHTRHTHRHTR,  REHGTRYUYTJUYJYTJUJUJUJYJUYJUY,  EWFDEWFEFREFRE etc as values.
      I am a bit confused as about the usage of this module though I browsed a bit in the respective manual page... Once you create the hash, how do you call it? Is it retrieved each time the script is called?
      If you can please supply a brief example of usage!
      Thanks a bunch!
        Creating an SQLite DB is not that hard. I haven't tested this code, but this is at least the general idea.
        #!/usr/bin/perl -w use strict; use DBI; my $dbfile = "./DNA.sqlite"; print "Creating new DNA Database\n"; if (-e $dbfile) {unlink $dbfile or die "Delete of $dbfile failed! $!\n +";} my $dbh = DBI->connect("dbi:SQLite:name=$dbfile","","",{RaiseError => +1}) or die "Couldn't connect to database: " . DBI->errstr; $dbh->do ("CREATE TABLE dna ( id integer PRIMARY KEY AUTOINCREMENT, protein varchar(10) DEFAULT '', sequence varchar(1000) DEFAULT '' ); "); $dbh->do('PRAGMA synchronous = 0'); # Non transaction safe!!! $dbh->do('PRAGMA cache_size = 200000'); # 200 MB dynamic cache increa +se # makes index creation faster $dbh->do("BEGIN"); import_data(); $dbh->do("COMMIT"); $dbh->do("BEGIN"); $dbh->do ("CREATE INDEX iprotein_idx ON dna (protein)"); $dbh->do("COMMIT"); sub import_data { my $add = $dbh->prepare("INSERT INTO dna ( protein, sequence) VALUES(?,?)"); #...your loop to read the data goes here # foreach protein and sequence... { $add->execute($protein, $sequence); } # }
        Update:
        Basically when creating the DB, you want to turn all the ACID stuff off.
        That means: don't wait for writes to complete, don't do complete transactions for each "add", run the cache size up from default of 20MB to at least 200 MB for more efficient index creation. Run as "lean and mean" as you can.

        Hi, it is fairly simple. You store the hash in a file, and then retrieve it when you want to use it (see example below). It is retrieved as a hashref, but you can make it a regular hash with a '%' prefix
        use Storable; store \%table, 'file'; $hashref = retrieve('file');
        Good luck
        MrGuy
Re: Efficient way to handle huge number of records?
by flexvault (Monsignor) on Dec 11, 2011 at 14:10 UTC

    The following code will generate the database using BerkeleyDB. I generated the keys, you would read them in from you file.

    #!/usr/local/bin/perl -w our $Debug = 0; our $SortLog = 0; use strict; + use BerkeleyDB; our %Session = (); use Time::HiRes qw( gettimeofday ); use Fcntl qw( :flock ); my $DBHome = "/home/FlexBase/"; if ( substr($DBHome,-1) ne "/" ) { $DBHome .= "/"; } if ( ! -d $DBHome ) { mkdir $DBHome or die "DBHome directory not correct!"; } if ( ! ( ( -r _ )&&( -w _ ) ) ) { die "DBHome directory is not valid!" +; } system("rm /home/FlexBase/*"); ## Temporary for testing our $filename = "/home/FlexBase/TestBDB"; use constant CACHE => '1048576'; my $Cachesize = CACHE; our $env = new BerkeleyDB::Env ( -Home => '/home/FlexBase', -Cachesize => CACHE, -Flags => DB_INIT_MPOOL|DB_INIT_CDB|DB_CREATE ) or die "cannot open ENV: $! $BerkeleyDB::Error\n"; our %Pyr = (); our $BLK = 512; our $db1 = tie %Pyr, 'BerkeleyDB::Btree', ( -Filename => "$filename", -Env => $env, # -Pagesize => 4096, ## Use of this makes the Ber +keleyDB operate worst! -Flags => DB_CREATE ); # create some data our $Chr="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123 +456789|"; our $size = length($Chr)-1; our $Rno = 0; our $LockFile = "../tmp/BDBmail_lock"; system("touch $LockFile"); open(LOCK,"+<",$LockFile) || die "Cannot open $LockFile:$!"; # Associate it with the database my $enable = $env->cds_enabled(); if ( $enable ) { print "cds_enabled\n"; } my $ret = 0; my $cnt = 0; my $Tot = 0; my $howmany = 11_000_000; my $ky = ""; my $data = "a" x 80; my $ap +pkey = "-123456789"; # $howmany = 2_000; ## Testing our ( $WR, $RD ); if ( $SortLog ) { my $logFile = "$filename"."_Write"; open($WR,">",$logFile) || die "Cannot open $logFile:$!"; $logFile = "$filename"."_Read"; open($RD,">",$logFile) || die "Cannot open $logFile:$!"; } our $NAME = "BerkeleyDB"; our $kb = "_KB"; our ($pmem1,$pmem2) = &Display_Mem_Usage($$,$NAME,0); print "## Start: VSZ-$pmem1$kb RSS-$pmem2$kb\tBLOCK: $BLK ( $howma +ny )\n"; my ( $cursor, $lock, $wpm, $key, $count, $k ); my $stime = gettimeofday; while ( $cnt < $howmany ) { # my $st = $cnt % 58; # my $chr = chr( 122-$st ); $key = "A-$cnt"; $cnt++; $Tot += length($key)+length($data); flock( LOCK, LOCK_EX ); $lock = $db1->db_cursor(DB_WRITECURSOR); $ret = $db1->db_put($key, $data); undef $lock; flock( LOCK, LOCK_UN ); if ($ret != 0 ) { die "Write failed on BDB3: $ret\n"; } if ( $SortLog ) { print $WR "$key$data\n"; } } flock( LOCK, LOCK_EX ); $lock = $db1->db_cursor(DB_WRITECURSOR); my $strt = chr(1); $data = ""; $ret = $db1->db_put($strt, $data); my $dbend = chr(255); $data = ""; $ret = $db1->db_put($dbend, $dat +a); undef $lock; flock( LOCK, LOCK_UN ); $db1->db_sync(); undef $cursor; my $etime = gettimeofday - $stime; $wpm = int($howmany/$etime); print "Write: $etime $wpm\/sec $cnt\n"; if ( $SortLog ) { close $WR; system("sort -o /home/FlexBase/TestBDB_Write.sort /home/FlexBa +se/TestBDB_Write"); } $stime = gettimeofday; $count = 0; $key = " "; $data = ""; $ret = 0; while ($ret == 0) { flock( LOCK, LOCK_SH ); $cursor = $db1->db_cursor(); $ret = $cursor->c_get($key, $data, DB_SET); $ret = $cursor->c_get($key, $data, DB_NEXT); flock( LOCK, LOCK_UN ); undef $cursor; if ( ord($key) == 1 ) { next; } if ( ord($key) == 255 ) { last; } $0 = "SLock 3 on $key"; $count++; ## if ( $count > 28 ) { last; } if ( $SortLog ) { print $RD "$key$data\n"; } ## print "$key\t$data\t$count\n"; } $db1->db_sync(); $etime = gettimeofday - $stime; $wpm = int($count/$etime); print "ReadNext: $etime $wpm\/sec Total: $count\n"; if ( $SortLog ) { close $RD; system("/usr/bin/sort -o /home/FlexBase/TestBDB_Read.sort /hom +e/FlexBase/TestBDB_Read"); } my ($mem1,$mem2) = &Display_Mem_Usage($$,$NAME,0); my $diff1 = $mem1 - $pmem1; my $diff2 = $mem2 - $pmem2; print "## End: VSZ-$mem1$kb RSS-$mem2$kb Diff:$diff1|$diff2$kb +\tBLOCK: $BLK\n\n"; exit; sub Display_Mem_Usage { # VSZ is size in KBytes of the virtual memory ( VSZ * 1024 ) # RSS is size in pages of real memory ( 1024 * RSS ) my $cpid = shift; my $name = shift; my $from = shift; my $var = ""; my $fh; my $tm = ""; if ( $Debug >= 2 ) { $tm = scalar localtime() . " "; } if ( ! ( kill 0 => $cpid ) ) ## Check that pid is active { return ( -1, -1 ); } my $arg = qq| -o "vsz rssize" -p $cpid|; open ( $fh, "-|", "/bin/ps $arg" ) or die "Prefork: Not open \'$ar +g\': $!"; while (<$fh>) { $var .= $_; } close $fh; my $rno = my @ref = split(/\n/,$var); if ( $rno < 2 ) { return ( -1, -1 ); } my $info = join(" ", split " ", $ref[1]); my ($vmem,$rmem) = ( split(/\ /,$info) ); return ( $vmem , $rmem ); } 1;

    And the output would be.

    # perl Show11M.plx cds_enabled ## Start: VSZ-10292_KB RSS-4808_KB BLOCK: 512 ( 11000000 ) Write: 302.145446062088 36406/sec 11000000 ReadNext: 340.443499088287 32310/sec Total: 11000000 ## End: VSZ-10292_KB RSS-6264_KB Diff:0|1456_KB BLOCK: 4096

    And the directory where the database resides would contain:

    -rw-r--r-- 1 24576 Dec 11 08:37 /home/FlexBase/__db.001 -rw-r--r-- 1 245760 Dec 11 08:37 /home/FlexBase/__db.002 -rw-r--r-- 1 1318912 Dec 11 08:37 /home/FlexBase/__db.003 -rw-r--r-- 1 548864 Dec 11 08:37 /home/FlexBase/__db.004 -rw-r--r-- 1 2100948992 Dec 11 08:26 /home/FlexBase/TestBDB -rw-r--r-- 1 989888890 Dec 11 08:34 /home/FlexBase/TestBDB_Read -rw-r--r-- 1 989888890 Dec 11 08:37 /home/FlexBase/TestBDB_Read.sort -rw-r--r-- 1 989888890 Dec 11 08:26 /home/FlexBase/TestBDB_Write -rw-r--r-- 1 989888890 Dec 11 08:29 /home/FlexBase/TestBDB_Write.sor +t

    And to test that all the data was loaded:

    # md5sum /home/FlexBase/TestBDB_Read /home/FlexBase/TestBDB_Write cd6ae6c532bc918f4ccdda9cc1d3f671 /home/FlexBase/TestBDB_Read.sort cd6ae6c532bc918f4ccdda9cc1d3f671 /home/FlexBase/TestBDB_Write.sor +t

    'diff' runs out of memory with files this large. But 'md5sum' has no problem. The extra step to verify the database is loaded is optional, but it is nice to know you have all of the data. I test different databases all the time, so there is a lot of extra code that you can delete. I did more deleting then adding to get what you needed.

    Good Luck

    "Well done is better than well said." - Benjamin Franklin

Re: Efficient way to handle huge number of records?
by Sewi (Friar) on Dec 11, 2011 at 14:15 UTC
    If a server-based database is no option and SQLite also isn't what you want, a tie()'ed hash would be an alternate option for fast data access. It isn't that fast like a memory stored hash but the speed is still ok.
    It might be faster or slower than a SQL-based database depending on what you're doing...
Re: Efficient way to handle huge number of records?
by TJPride (Pilgrim) on Dec 11, 2011 at 13:33 UTC
    As stated above, yes, Perl can handle a hash of this size - assuming sufficient memory. However, this is probably not ideal on a lot of computer systems. I would highly recommend either a MySQL database or a restructuring of the data to a fixed-length, sorted format that you can then do an ordered lookup on whenever you need to locate a record. The latter method does require loading all the data into memory and then sorting it once, but you could theoretically do that on someone else's computer if necessary, since it's a one-time thing. The sorted data, once written to disk, would take virtually no overhead or lookup time to locate a specific record (specifically, log2 n, where n is the total number of records). The MySQL database would have the advantage of more flexibility, and you don't necessarily have to feed MySQL the records one at a time - you could just convert the data line by line to INSERT commands and then save as .sql and do an import all at once. Probably a way to do it directly from CSV or delimited or fixed-length as well. As long as the database is on the same computer as the data file, you should get pretty fast import.
Re: Efficient way to handle huge number of records?
by sundialsvc4 (Abbot) on Dec 12, 2011 at 13:26 UTC

    If you’ve got the memory, and it is uncontested memory (i.e. swapping will not occur), then your problem is indeed an easy one:   “throw silicon at it,” and be done.   Nothing except the CPU itself is faster than silicon.   If you don’t have enough memory (but, you do have a 64-bit machine), well, how much do those chips cost?

    Obviously, there is time spent loading all that data into memory, which may or may not be desirable.   (It really depends on how much of the data you expect to process at any one time.   It would be a one-time cost per run.)

    Another possibility is to use (say...) an SQLite database file purely as an index to the file, storing the starting-position of the interesting bits as integer offsets from the beginning of the file.   Use the index to find what you’re looking for.   Also consider sorting this list of offsets (in memory, of course) into ascending order so that the hardware can zip right through the file sequentially from front to back, seeking to always-forward positions as needed.   May not make a difference, but it might.

    It is easy to get too-clever and to spend a lot of time and effort implementing things that really do not matter in the end; that sound cool in theory but that really do not speed things up vis à vis the time spent writing and debugging it.   A simple stopwatch comes in handy.   It may well be that you keep a hash of what you’re looking for and the program reads the entire file each time (or until it has found everything), and, even though the run time is “larger than it might be,” it is consistent.

      P.S. SQLite is very efficient and very fast .. but pay close attention to "transactions" when using it. If a transaction is under way it's very clever about lazy-writing; but if not it verifies every single disk write. (As it is designed to do.) Makes a huge difference in speed when writing. Nice thing about SQLite (aside from the fact that it seems to be pretty much bulletproof) is that "a database = a file, nothing more or less." Maybe just the ticket for building (any number of) handy indexes into your file.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://942908]
Approved by Marshall
Front-paged by Marshall
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2024-04-18 00:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found