comment on

Well, I experimented a little further with the CMPH. It might be rough around the edges (e.g. temp. files aren't cleaned up or created safely), but the good thing is it basically works. First, I generated some data as follows.... that took awhile.

#! /bin/sh
perl -E '$/ = \36; for (1 .. 200e6) { ($v,$s) = unpack q<QA*>,<>; say 
+qq($_ @{[$s =~ y/a-zA-Z//cdr || redo]} $v) }' /dev/urandom |
sort -k2 | perl -ape '$_ x= $F ne $F[1], $F = $F[1]' |
sort -k3 | perl -ape '$_ x= $F ne $F[2], $F = $F[2]' |
sort -n  | perl -pe 's/\S*\s*//'
[download]

In case many longer keys are present, it might be better to go with (4-byte) offset records. Simple and compact, but there's one more indirection, hence slower access.

$ perl -anE '$h[length$F[0]]++ }{ say "@h";' data
 52 2704 140608 7190883 35010259 35855576 28751420 19240344 10899542 5
+278104 2202814 795438 249757 68269 16155 3388 640 89 12  1
[download]

Then a small test. You can see I didn't bother mapping value indexes but used order-preserving hash instead. Memory usage comes to about 26 bytes/pair. Double that while building the hashes.

Edit. Same test with updated code; higher -O used.

[         1.303844] data ALLOCATED; tab = 160002048, ss = 14680064 (10
+000000 pairs)
[         5.478031] built BDZ for syms
[         2.873171] inplace REORDER
[        20.015568] built CHM for vals
[         0.000028] mph size when packed: syms = 3459398, vals = 83600
+028
[         0.522367] fgets loop; lines=10000000
[         1.195339] fgets+strtoul; lines=10000000
[         2.235220] SYMS fetch; found=10000000
[         2.940386] VALS fetch; found=10000000
[         2.709484] VRFY sym to val; matched=10000000
[         4.258673] VRFY two-way; matched=10000000
[download]

~~Old output.~~

In reply to Re^2: Bidirectional lookup algorithm? (try perfect hashing) by oiskuu
in thread Bidirectional lookup algorithm? (Updated: further info.) by BrowserUk

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Think about Loose Coupling
	PerlMonks