comment on

Hi guys,

First of all thanks for your answers...

My mistake was that i didnt explain you what is one inverted index...

For each term that occurs in the document collection
which i index, i save in a DBMS a string with some required info
This info is the Doc ids wich the term occurs and the positions where the term occurs in each document...

Inverted Index schema for the word with ID eq 1:
(this word appears only in two pages)

WORD ID==>POSTING LIST
1 ====> "Doc1 Space Pos1;Pos2;..PosN; Space Doc2 Space Pos1;Pos2;..PosN;"

As i told you before the problem is when i index a big
document collection (f.e. The whole WIKIpedia). In this case
the strings(posting List) where i save the information for each term is very large
and the fetching from the disk is very expensive.
Thats why i told you the solution is to compress efficient the posting lists..

The biggest problem is that i am not use the pack function so i am working with scalar
values and not BINARY ...
The document collection is not something standar neither the size
of each document so i cant answer to these questions..

A very common compression technique is the Elias Gamma code (http://en.wikipedia.org/wiki/Elias_gamma_coding)
In the Wiki page there is code in JAVA and i translate it in Perl like below...


sub GammaCode
{
    my $n = $_[0];
    my $gammaCode;
    my $mystring;
    my $log = int(log2($n));
    return 1 if $n==1;
    for(1 .. $log)
    {
        $gammaCode .="1";
    }
    $gammaCode .="0";
    $mystring = dec2bin($n);
    $mystring =~ s/^\w{1}//; 

    $gammaCode .=$mystring;
    return $gammaCode;
}
sub log2 {
    my $n = shift;
    return log($n)/log(2);
    }



sub dec2bin {
    my $str = unpack("B32", pack("N", shift));
    $str =~ s/^0+(?=\d)//;   # otherwise you'll get leading zeros
    return $str;
}

sub bin2dec {
    return unpack("N", pack("B32", substr("0" x 32 . shift, -32)));
}
my $number=11;

print "The Gamma reprasentation of $number is\t",GammaCode($number),"\
+n";
[download]

The above sub is working right for the compression but the problem is that i
dont know how to write the return value of the sub as binary ... Furthermore i
dont know how to decode this algorithm..
If you have any touch with the Information Extraction field
and very good knowledge of the pack/unpack functions plz help...

Thanks for your interest!!
Mimis

P.S BrowserUk you have right, i should refer the source of
the byte-alligned algorithm...thanks that you did for me
It seems that you are familiar with this field,
i hope you can help me...

In reply to Re: Compress positive integers by MimisIVI
in thread Compress positive integers by MimisIVI

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Perl-Sensitive Sunglasses
	PerlMonks