Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^4: Compress positive integers

by tachyon-II (Chaplain)
on Apr 09, 2008 at 00:35 UTC ( [id://679119]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Compress positive integers
in thread Compress positive integers

You don't seem to be listening. What you are trying to do will never work at scale. You need to do all the heavy lifting on the RDBMS side. The heart of the problem is this:

column Posting List ==> "10 20;3 21 2; 43 100;105;......

The core problem with this format is that you MUST pull it back into perl to manipulate it. The more documents you have and the more words per document the more data you need to pull to process your query. This will never scale. Not today, not tomorrow, not never. Compression will only vaguely paper over the cracks, if it helps at all (because of the overhead price).

The table format suggested by BrowerUk is not optimal but is along the lines of what you want as it indicates the need for 3 basic tables. For best speed you want to avoid BLOB and VARCHAR ie anything that is not fixed width as this makes indexing much harder and it is the indexing that is going to provide the speed/scalability in the same way a hash allows you to have fast "random" access to a large array.

The basic unit of data you want to store consists of

WORD IN_DOCUMENT AT_POSITION

To efficiently store this you will want to store this unit of data in a compressed form. The compressed form you really want is one where the WORD becomes a WORD_ID and the IN_DOCUMENT becomes a DOC_ID. If you think about a search you want to input words to search for, thus the words table needs to rapidly convert a WORD to a WORD_ID. Using this we will look in our basic unit data table and get back some DOC_IDs which we then need to convert back to IN_DOCUMENT. Thus the tables you need are:

TABLE words word CHAR(32) PRIMARY KEY word_id INT TABLE docs doc_id INT PRIMARY KEY doc_name VARCHAR TABLE word_doc word_id INT doc_id INT pos INT

To complete this we must create an index on the word_doc table. This table as it stands is just a big array. When we create an index on it it effectively becomes a big hash that we can get rapid random access to

CREATE INDEX warpspeed ON word_doc (word_id)

With this structure in place you can then make a query into the word_doc table that will return the data you want. The basic query will be:

SELECT doc_id FROM word_doc WHERE word_id = ? INTERSECT SELECT doc_id FROM word_doc WHERE word_id = ?

This will work but will still be slow. The problem is that it is a big task and we can't use the index to go straight to the data we want. The solution to this is to predict and cache.

Replies are listed 'Best First'.
Re^5: Compress positive integers
by MimisIVI (Acolyte) on Apr 09, 2008 at 13:19 UTC
    Hi tachyon-II,

    First of all i have a question to you...
    Have you ever try to do these things that you say in practice???

    Have you ever try to do the intersection that you propose to me in a
    table with 2000000000 records???

    If you did it and the perfomance was sutisfied , i cant say anything... but because before end up in this shema i spend
    4 months tried all these shema that you said with MySQL and the results wasnt sutisfied...(i read all the
    optimizations tips about MySQl query and believe me the perfomance wasnt good)
    i can say for sure that the shema with the posting lists is the best for large documents
    collections...(by the way Lucene somehow has the same index shema too...)

    NOw about the tables that you propose, the first is the very famous LEXICON and offcourse i use it in my application...
    The same happends with the second one where i save the path or the adress of the each document that i am indexing...


    Now one thing that i want to test is to use the file system
    to create my index instead of the DBMS(SQL SERVER 2005)
    but i dont know if i will be efficient to use in binmode files to save all the required info that i need..


    One last question for me is how can i read from a bit string one bit per time...
    i fetch from the DBMS a binary compressed string with all the docids for one term..
    To decode the string i use the below code wich is extremely fast...
    my $decode = unpack "b*",$compressed;

    But because i need to read from the string a bit per time and the below code is very slow ...

    my @bits = split(//, unpack("b*", $compressed));

    i want to ask if there is anyway to read from the compressed or the decode string a bit per time...



    Thanks for your Time..


    Mimis

      Well basically the answer is yes I have done this, and so for that matter has BrowserUk. Not this exact task but a very similar one.

      Let me clarify. In 2002-2003 BrowserUk and I worked on a project that pulled down the entire content of the then existent internet (looks about the same size as your dataset :-), tokenised it into 1, 2 and 3 "word/phrase" tokens, applied Bayesian algorithms to the data (which involved finding the intersection of multigigabyte data sets), ran all the data munging algorithms in perl using flat files, sorted and merged the results, and then dumped the results into a RDBMS at the end. We managed to reduce the initialisation runtime from greater than the known life of the universe to under 24 hours. This was done under the limitations of using quad Xeon processors, 4 GB of RAM and 160 GB of disk as I remember it.

      In our system the final output was essentially a single huge RDBMS table. Our index (although bigger than the table) still ended up being small enough to fit into 4 GB of RAM and the systems that ran in production could manage 300+ queries per second in real time running perl, mysql, squid and apache. It ran fast because the final query required very little work, and the result could be found straight from the index.

      Its hard to explain to you how much optimisation was required. To get linux to transfer data at 50 MB/sec you have to have fast disks but you also need to give it the freedom to do its job. You need to specifically give your RDBMS the ability to hold its indexes in memory and you must have the right indexes. You need to optimise your schema, you need to Benchmark and you must find and remove the bottlenecks. There is no point in making fast code faster. You need to make slow code faster first (bottleneck) and then make fast code faster. Have a look at this for a start. On a typical system you can improve the performance of mySQL by AN ORDER OF MAGNITUDE (10x) by modifying the my.cnf file - without changing anything else.

      Every time I though I need to use C to make it faster either BrowserUk or I (usually him) came up with a solution (better algorithm) that let us do it in Perl.

      You keep saying I want to do it this way. Well here is the rub. You don't seem to do C/C++. You don't seem to understand bit string operations. You don't seem to understand how the overhead of transfering data to perl when you don't need to is going to kill you. Most importantly you don't seem to understand that no amount of optimisation will make a slow algorithm fast.

      Yes you can "pre-calculate" stuff using raw data and do it outside a DB. And yes it can make the final queries fast. We did. Google do. For example they don't try to query their index for proximity of "George Bush" as two words everytime someone types it in. They do it once, cache the result and then run off cache, updating periodically. Caching is everywhere. The disks do it, to OS does it, the RDBMS does it and all these caches can be optimised to the task at hand.

      For example if you do what we did and tokenise every page we could find on the internet you will end up with around 100,000 "tokens". Not that many but if you go to two token pairs the worst case is 100,000^2 or 10 billion. This is of course a little less manageable however as it turns out the number of actual pairs is considerably less and if you ignore rare tokens ie pairs that happen less than a few times the number falls into the millions. With 3 word tokens you have to be a bit more brutal but you still end up with "only" around 100 million tokens (3 word triplets). Using the data storage format I suggested you need 4 bytes x 3 to store this data which gives you a table size of "only" 1.2 GB.

      FWIW to do what you want you just need vec or the bitshift operators. << and >>. They won't help but HTH anyway.

        Hi tachyon-II one more time,


        First i want to thank you for your time to explain these things which are very interesting to me...

        I am not afraid to say that you are right about the things that you said for me, but because the IR is a field which i love it and i want to do it in my whole life i have the courage to learn everything that i need to be an expert on that..So comments like yours make me to try harder and harder.. thanks for that..

        About the indexes that i tried with MySQl i have to say that were completely on disk , even the lexicon too (i dont even used the query cashe of MySQL) ..Maybe that was stupid but i am trying to find a shema wich will be fast and small and can work in any PC without system optimizations..may that is stupid too but its something that i like to research so i dont care about my wasting time..


        About you comment for my algorithm, do you say for my index shema or the Elias gamma code?? I supposed for both ..:)I want to ask you if you tried this schema in your project (with the long posting lists..)?? I supposed yes...

        One thing that i want to listen from you guys is your opinion about Lucene index schema.. I AM trying to figure out what is going on with that index tool but my knowledge in low level commands in java and perl(for the Perl ports like KinoSearch) are zero..

        Thanks again for your patience ..

        Chears..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://679119]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-19 03:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found