Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Re: RFC: Abusing "virtual" memory

by jbert (Priest)
on Nov 27, 2007 at 11:03 UTC ( #653189=note: print w/replies, xml ) Need Help??

in reply to RFC: Abusing "virtual" memory

Of course, 'sort' has to either read everything in before producing any output (since the last line read could sort to the front), or do a bunch of seeking around and re-reading. Even if sort reads everything into mem, you probably still win because the perl scalars were bigger than the lines sort was holding in memory.

But the real issue is: "Why use a disk-based hash store when you need to process the keys in sorted order?" (Do you need to process them in sorted order?)

If your keys are sequential, a simple fixed-length record file allows very good performance (you can add new keys to the end, and read a value with a single seek+read).

If your keys are more complex, I'd bring in an external indexing engine in the form of a db such as SQLite (or mysql, or postgres, or...).

Replies are listed 'Best First'.
Re^2: RFC: Abusing "virtual" memory
by shmem (Chancellor) on Nov 27, 2007 at 16:22 UTC
    But the real issue is: "Why use a disk-based hash store when you need to process the keys in sorted order?" (Do you need to process them in sorted order?)

    For that DB_File provides a disk-based hash store with sorted keys - DB_BTREE.

    update: added link


    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      Ah, cool, thanks.

      In which case the question is more simple: "why sort your keys in the app when you can get the db to do it for you?" :-)

        “Why sort your keys externally when you can get the DB to do it for you?”


        To seek to answer that question, I give you today's workload of 11,344,209 telephone call records. You have exactly 4 hours wall-time to process them. If you imagine that you have enough time to put all those records into a B-tree-indexed random file, I have a bridge to sell you. Instead, to solve this problem and to do so consistently on a daily basis, it will be necessary for you to accomplish the same workload -- with utter reliability and consistency -- much faster.

        It may come as an utter and complete shock to you to fathom that your grandfathers, armed with nothing more punched-card tabulators and sorters, with nary a digital computer in sight, could do that. Every day. Under wartime conditions.

Re^2: RFC: Abusing "virtual" memory
by sundialsvc4 (Abbot) on Nov 28, 2007 at 04:40 UTC

    Good sir or madam, I wouldst delicately and most-diplomatically point out that it is, in fact, not so that your intuitive suppositions are correct ... but that, in fact, it is precisely my key-point that, (counter-intuitively though they may-or-may-not be), they are not.

    For at least a generation, our predecessors made-do without computers of any kind... they used only punched cards, and with those primitive tools they accomplished miracles.

    For another full generation beyond that,computers existed, yes, but copious memory-sizes did not. “A million bytes of memory?! Gosh!”

    It is, indeed, precisely(!) my point that our grandfathers with their punched-card machines had, and still have, a valuable lesson to teach us.

      Sorry, I have no idea if your two replies to me in this thread are as a result of my having offended you or not. If you're offended I'm sorry, that wasn't my intent.

      I'm probably being dense, but as far as I could see, neither of your responses answered the question:

      "If you need to process the keys in sorted order - and the volume of data of the keys is significant - why are you storing them in a data structure which does not keep them in a sorted order?"

      Your oblique references to performance perhaps suggest that you think the cost of updating the index on insertion will be too high to meet your throughput needs.

      Is this right?

      If there's some other meaning, I'm afraid I've missed it. Perhaps you could rephrase it?

      Update: is this all a joke, btw? 44Mb of RAM isn't a significant amount out of the buffer cache of "a very expansive computer, plenty of memory, fast CPU". Or is this all on a historical machine? Or is your point something about human versus computer computations?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://653189]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2021-01-23 12:01 GMT
Find Nodes?
    Voting Booth?