Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

If you find Berkeley too slow, or too memory expensive in practice, you might reconsider the md5 hashing suggestion. I've implemented this for my own purposes now, and the highlights are:

Indexing 100_000_000, 160 byte records in a data file.

  1. The index requires 2.23 GB for 100 Million records. Data record size does not affect the index size. It is a fixed 24-bytes/record.
  2. Building and sorting the index: 3 hours (worst case; under 1 hour best);
  3. Accessing 10_000 randomly chosen records: Under 3 minutes.

    That's locating in the index entry and reading the record combined.

    Worse timing: 1000 trials of binsearch ( 37.753s total), 37.753ms/trial

    Best timing: 10000 trials of binsearch ( 175.755s total), 17.576ms/trial

    Update: A 100,000 thrials just completed:

    100000 trials of binsearch ( 1,643s total), 16.437ms/trial

  4. Insert/delete* a record: Currently 1.2 seconds.

    This can be improved, I believe substantially.

    Insertion appends the new record to the end of the data file, and inserts the appropriate index entry.

    * Deletion consists of removing the index entry and adding it to a "deleted.idx" file.

    The actual record remains in-place until a compaction process is run. The latter is not part of the timing above.

The above is hitting the disk (or system caches) for every read. I have some ideas for adding my own buffering , but they need testing.

The test system was XP/NTFS on a 2.2 Ghz processor with a WDC WD800BB-75CAA0 75GB HD.

The datafile is compressed, the index not.

For contrast, I gave up waiting for Berkeley to build a BTree index from a pre-sorted file of index entries after 18+ hours and 57% complete. Hence, I don't have figures for access/insertions or deletions.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re^3: size on disk of tied hashes by BrowserUk
in thread size on disk of tied hashes by danderson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (7)
As of 2024-04-23 19:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found