http://qs321.pair.com?node_id=678848

MimisIVI has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I build a Search engine which use an inverted index.
For small number of documents is quiet fast but not for
large ...


The solution is to compress the data that i am saving in
the index but i have problem to understand the two most
helpfull Perl's functions (pack and unpack)...


The are a lot of compression techiques, like Elias delta
code, Golomb code, e.t.c.


I made a reasearch for Perl code in the net but with
no any luck. I found this code in C++ which simply encode and
decode positive integers by use the byte alligned compression techique..

Encode Integers: int outPos = 0, previous = 0; for (int inPos = 0; inPos < n; inPos++) { int delta = uncompressed[inPos] - previous; while (delta >= 128) { compressed[outPos++] = (delta & 127) | 128; delta = delta >> 7; } compressed[outPos++] = delta; } Decode Integers: int outPos = 0, previous = 0; for (int outPos = 0; outPos < n; outPos++) { for (int shift = 0; ; shift += 7) { int temp = compressed[inPos++]; previous += ((temp & 127) << shift); if (temp < 128) break; } uncompressed[outPos] = previous; }

In each case, n postings are processed. Uncompressed
postings are stored in the integer array uncompressed,
compressed postings in the byte array compressed.


Unfortunately my knowledge is very low in C++ and i
need your help to translate this code in Perl!!!

If anyone knows any other compression techniques (except of the Huffman because its too complex for me..) for
possitive integers will be huge help for me too...


Thanks in Advance!
Mimis

Replies are listed 'Best First'.
Re: Compress positive integers
by ikegami (Patriarch) on Apr 07, 2008 at 23:08 UTC

    For starters, you'd get better compressing using one of the numerous compression modules on CPAN. (Some sample queries: Compress, gzip, zip)

    The technique used by the code you posted is useful to save space on disk when dealing with lots of small numbers. It takes fewer bytes to save small numbers, and an extra byte to save very large numbers. It's advantage over general purpose compression algorithms is that you can seek to known and unknown positions into the file.

    1 byte: 0..127 2 bytes: 128..etc 3 bytes: 16_384 4 bytes: 2_097_152 5 bytes: 268_435_456 6 bytes: 34_359_738_368 7 bytes: 4_398_046_511_104 8 bytes: 562_949_953_421_312 9 bytes: 72_057_594_037_927_936

    Depending on the size of the ints on your system and the precision of floats on your system, some of the above aren't available or more than the above is available.

    Untested:

    sub compress_uint { my ($uint) = @_; my $compressed = ''; while ($uint >= 128) { $compressed .= chr(($uint & 127) | 128); $uint >>= 7; } $compressed .= chr($uint); } for (100, 1000, 100000) { print $fh compress_uint($_); }
    sub decompress_uint { our $compressed; local *compressed = \($_[0]); my $bytes_read = 0; my $uint = 0; for (;;) { die if length($compressed) == $bytes_read; my $byte = ord(substr($compressed, $bytes_read++, 1)); $uint = ($uint << 7) | ($byte & 127); last if $byte & 128; } $compressed = substr($compressed, $bytes_read); } my $raw = do { local $/; <$fh> }; my @nums; while (length($raw)) { push @nums, decompress_uint($raw); }

    Update: Fixed some typos in text and code.

Re: Compress positive integers
by BrowserUk (Patriarch) on Apr 08, 2008 at 02:02 UTC

    Had you posted a link to the pdf describing the algorithm and it's performance advantages, you might not be coping the flack and "you don't wanna be doing that"s, that you're getting.

    With regard to the algorithm, you will not realise those performance benefits if you code this in Perl.

    With regard to your problem:

    • How big are your indexes?

      Ie. How many offsets do they contain? (How many documents are you indexing?)

    • How big are your offsets?

      IOWs, how big is your largest document?

    • How are you writing your indexes?

      If you're not writing them in binary (ie. packing them), then you are wasting large amounts of space and time.

    Answer those questions and we may be able to help you.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Compress positive integers
by FunkyMonk (Chancellor) on Apr 07, 2008 at 23:09 UTC
    because its too complex for me
    Then let someone else worry about the complexity

    Search CPAN for compress and see if anything usefull turns up

Re: Compress positive integers
by tachyon-II (Chaplain) on Apr 08, 2008 at 01:13 UTC

    For small number of documents is quiet fast but not for large ...

    Which means that your code almost certainly runs in some form of exponetial time. I don't see how compression will not help you in either the short term as the overhead of the compression/decompression will hurt you or the long term as I can't see how your fundamental algorithmic problem is addressed by compression. An algorithm is still On^2 with or without compression.

    In broad terms compression will only speed up an application if it lets you pass data faster through a narrow bandwidth chokepoint where the time price of compression/decompression has a worthwhile payoff.

    If you have already observed that your code grinds to a halt on large data sets you need to fix the fundamental why of that issue. If you explained your problem more clearly it is possible that you may get a useful solution. I am almost certain that compression is not the solution you need.

    Cheers

    tachyon

Re: Compress positive integers
by Cody Pendant (Prior) on Apr 07, 2008 at 23:47 UTC
    Not to get all Y-X on you, but the solution is to use a database, surely?


    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...

      Not to get all pedantic on you, but a solution is to use a database. There are other solutions.

      <radiant.matrix>
      Ramblings and references
      The Code that can be seen is not the true Code
      I haven't found a problem yet that can't be solved by a well-placed trebuchet
        If you will read below...you will see that i am using a database....:)
Re: Compress positive integers
by MimisIVI (Acolyte) on Apr 08, 2008 at 02:41 UTC
    Hi guys,

    First of all thanks for your answers...


    My mistake was that i didnt explain you what is one inverted index...

    For each term that occurs in the document collection
    which i index, i save in a DBMS a string with some required info
    This info is the Doc ids wich the term occurs and the positions where the term occurs in each document...



    Inverted Index schema for the word with ID eq 1:
    (this word appears only in two pages)

    WORD ID==>POSTING LIST
    1 ====> "Doc1 Space Pos1;Pos2;..PosN; Space Doc2 Space Pos1;Pos2;..PosN;"


    As i told you before the problem is when i index a big
    document collection (f.e. The whole WIKIpedia). In this case
    the strings(posting List) where i save the information for each term is very large
    and the fetching from the disk is very expensive.
    Thats why i told you the solution is to compress efficient the posting lists..

    The biggest problem is that i am not use the pack function so i am working with scalar
    values and not BINARY ...
    The document collection is not something standar neither the size
    of each document so i cant answer to these questions..

    A very common compression technique is the Elias Gamma code (http://en.wikipedia.org/wiki/Elias_gamma_coding)
    In the Wiki page there is code in JAVA and i translate it in Perl like below...

    sub GammaCode { my $n = $_[0]; my $gammaCode; my $mystring; my $log = int(log2($n)); return 1 if $n==1; for(1 .. $log) { $gammaCode .="1"; } $gammaCode .="0"; $mystring = dec2bin($n); $mystring =~ s/^\w{1}//; $gammaCode .=$mystring; return $gammaCode; } sub log2 { my $n = shift; return log($n)/log(2); } sub dec2bin { my $str = unpack("B32", pack("N", shift)); $str =~ s/^0+(?=\d)//; # otherwise you'll get leading zeros return $str; } sub bin2dec { return unpack("N", pack("B32", substr("0" x 32 . shift, -32))); } my $number=11; print "The Gamma reprasentation of $number is\t",GammaCode($number),"\ +n";

    The above sub is working right for the compression but the problem is that i
    dont know how to write the return value of the sub as binary ... Furthermore i
    dont know how to decode this algorithm..
    If you have any touch with the Information Extraction field
    and very good knowledge of the pack/unpack functions plz help...

    Thanks for your interest!!
    Mimis

    P.S BrowserUk you have right, i should refer the source of
    the byte-alligned algorithm...thanks that you did for me
    It seems that you are familiar with this field,
    i hope you can help me...

      You can save about 60% of the space by storing this data in binary. The following assumes that your each of your documents can be upto 4GB (hence the 'N' pack format). (This is very tersely coded just for a quick test):

      #! perl -slw use strict; use List::Util qw[ sum ]; use Math::Random::MT qw[ rand ]; use Data::Dump qw[ pp ]; $Data::Dump::MAX_WIDTH = 1000; use constant NO_OF_DOCS => 10000; use constant DOC_PER_WORD => NO_OF_DOCS / 2; use constant OCCURANCES_PER_DOC => 20; ## Build some test data my @words = qw[ the quick brown fox jumps over a lazy dog twice ]; my %index = map{ $_ => {} } @words; for my $word ( keys %index ) { for my $doc ( map{ int rand NO_OF_DOCS } 1 .. DOC_PER_WORD ) { my $docName = sprintf "Doc%05d", $doc; $index{ $word }{ $docName } = [ map{ int rand 2**32 } 1 .. OCCURANCES_PER_DOC ] } } ##pp \%index; ## Building the posting lists in ASCII my @postings = map { my $wordHash = $_; join ' ', map{ "$_ " . join ':', @{ $wordHash->{ $_ } }; } keys %{ $wordHash }; } values %index; ## pp \@postings; printf "ASCII uses %d bytes\n", sum map length, @postings; ## Doing them in binary my @binPostings = map { my $wordHash = $_; pack '(N/A*)*', map { pack 'C/A* A*', $_, pack 'N*', @{ $wordHash->{ $_ } }; } keys %{ $wordHash }; } values %index; printf "Binary uses %d bytes\n", sum map length, @binPostings; #pp \@binPostings; <>; ## Unpack the packed binary my $wordIndex = 0; for my $wordData ( @binPostings ) { print $words[ $wordIndex ++ ]; for my $docData ( unpack '(N/A*)*', $wordData ) { my( $doc, $posns ) = unpack 'C/A* A*', $docData; print "\t$doc: ", join ', ', unpack 'N*', $posns; } } __END__ ## Output for 10 words appearing 20 times in each of ## half of 10,000 4GB documents at random positions. C:\test>678848 ASCII uses 8,801,314 bytes Binary uses 3,656,853 bytes

      But don't stop reading yet.

      If you are storing these records in an RDBMS, you are wasting the potential of that tool. Doing it this way (in either ASCII or binary), to find all the documents that contain 2 words, you are having to retrieve and unpack all the positions for both words in every document in which they appear, and then doing the union of the document sets in Perl.

      RDBMSs exist to do set algebra. Doing it in Perl and having to retrieve and unpack all that data to do it is silly. You'd be far better off changing your schema to allow the RDBMS to give you only those offsets for those documents that contain the words or words you require. But you'll need a DBA, or at least someone who SQL skills are a lot more current than my own to set you straight on that.

      A schema something like:

      TABLE words word-id MEDIUMINT primary key word VARCHAR TABLE docs doc-id INTEGER primary key docname VARCHAR TABLE word-doc word-id MEDIUMINT REFERENCES words( word-id) } primary key doc-id INTEGER REFERENCES docs( doc-id ) } posns MEDIUMBLOB

      Should be far more compact and allow you to issue queries that find (for example) all the documents contain two (or more) given words and only return the offsets for those words in those documents. Far more of the work done by the RDBMS and far less data to be read, transmitted and unpacked.

      You could then go a step further and investingate bitstring fields and boolean algebra.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Hi again,

        I am afraid that you didnt understand the shema that i use...

        One more time...

        Document Collection: 5000 documents
        Average size of each document(nr of words): 554 words
        Number of individual words that appear in the colection: 15000


        Now for each word that appears in the collection (in this case 15000 times)
        i save in the DBMS 15000 posting lists...like below...


        word="Gratis" --> word id=15
        column word Id ==> 15
        column Posting List ==> "1 2;3;4; 2 5;2; 3 1;15;......

        word="Twee" --> word id=10
        column word Id ==> 10
        column Posting List ==> "10 20;3 21 2; 43 100;105;......

        Like this way for each of the 15000 words....

        So if the user will give the query -> "gratis twee"

        i will fetch the two above posting lists and i merge them
        to find wich documents have the both terms and i will rank higher the docs which
        have the terms knear to each other by checking their positions(proximity score)


        i am afraid that your solution isnt what i actually i look for,
        ..but anyway thanks for your time..


        As i wrote in previous post i found a way to encode
        postive integers not matter how big they are by
        using the Elias gamma code...

        An example of the code is like below...

        Number(ASCI) --> Gamma representation(BITS) 1 = 1 2 = 010 3 = 011 4 = 00100 e.t.c...


        My first question is how can translate the below string as bit string...

        posting list==> "1110001110101011111101101111011"

        and my last question when i will fetch this string how i
        will unpack it and read one BIT per time..?

        The decode process is like that:

        1. Read and count 1s from the stream until you reach the first 0. Call this count of ones N.
        2.read the next N bits of the stream and translate them in ASCI..
        (f.e. 101 = 5 )
        3.So to decode the number i sum N to the 2 power with the number of the step 2.


        So the first number is the 9 = 1110001.
        The second is the 6 = 11010...and e.t.c.

        I hope you can help me ....

        Regards
        Mimis