Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

The Upper Limit of Perl's Native Data Structures

by arunhorne (Pilgrim)
on May 21, 2003 at 12:23 UTC ( [id://259700]=perlquestion: print w/replies, xml ) Need Help??

arunhorne has asked for the wisdom of the Perl Monks concerning the following question:

Monks,

I have just been in correspondence with Jarkko Hietaniemi the author of the excellent Graph::Directed modules with reference to using them to store graphs with ~3000 nodes and ~1.5million edges. He believes that his module will not scale well to graphs of this size and wonders if the native perl data structures will begin to "creak". So I have two questions off the back of this.

1. Seeing as Graph::Directed is back by hashes, what is a realistic maximum size for the capacity of a hash or an array in Perl given factors such as the hashing function and local memory etc.

2. As I insert this large graph into the Graph::Directed module I note what appears to be an exponential slowdown as I insert edges. Clearly this is in relationship to the datastructure backing the Graph object, but does anyone know the big-O notation complexity of inserting edges and nodes into the data structures used by Graph::Directed.

I look forward to your replies

____________
Arun
  • Comment on The Upper Limit of Perl's Native Data Structures

Replies are listed 'Best First'.
Re: The Upper Limit of Perl's Native Data Structures
by Joost (Canon) on May 21, 2003 at 12:50 UTC

    Theory:

    A hash in perl uses at least 32 bits for the bucket index, so that would translate to 2Gig buckets, so a perfectly filled hash could contain 2Gig elements (if not perfectly filled, it could be a lot more or a lot less). Also, depending on your machine, it could use more bits for the index.

    Practice:

    This code:
    #!/usr/bin/perl -w use strict; $|=1; my @nodes; for (1 .. 3000) { push @nodes,{ number => $_ }; print "\r$_" unless $_ % 1000; } my @edges; for (0 .. 2500000) { push @edges,{ s => \ $nodes[$_ % 3000], d => \ $nodes[3000 - $_ % +3000] }; print "\r$_" unless $_ % 1000; } print "\ndone, press enter to exit"; my $dummy = <STDIN>;
    Eats up about 430 Mb on my machine: perl v5.8.0 built for i386-linux-thread-multi.

    It's not super-fast, but it still works. However, how much memory Graph::Directed objects will take, and how long operations on the graphs will take is something you'll have to test for yourself.

    Joost.

    -- #!/usr/bin/perl -np BEGIN{@ARGV=$0}s(^([^=].*)|=)()s; =Just another perl hacker\
Re: The Upper Limit of Perl's Native Data Structures
by Abigail-II (Bishop) on May 21, 2003 at 13:12 UTC
    Seeing as Graph::Directed is back by hashes, what is a realistic maximum size for the capacity of a hash or an array in Perl given factors such as the hashing function and local memory etc.

    What do you mean by a "realistic maximum size"? The time to insert into a hash is fairly independent to the size of a hash, unless you are really unlucky and items hash to the same buckets. Every now and then, a single insert will take more time because the C-level array needs to be expanded and items rearranged, but this amortizes out over previous inserts. However, were you to plot a graph how long it takes to fill an N-element hash, you wouldn't get a fairly straight line, but you'd get a broken line. That's because at one moment, your program will begin to swap. But were swapping begins is highly dependent on your system.

    As I insert this large graph into the Graph::Directed module I note what appears to be an exponential slowdown as I insert edges.
    I can't combine an exponential slowdown with Jarkko inserting 1.5 million edges. He isn't that old.

    Abigail

      I can't combine an exponential slowdown with Jarkko inserting 1.5 million edges. He isn't that old.
      ROTFL! Hilarious.

      Makeshifts last the longest.

      I created a small program that inserts N different values into a hash, and timed how long it took for several values of N. You can see the plot at http://perl.abigail.nl/Images/hash_insert.png. It plots the average insert time against the total size.

      The curve skyrockets when the hash is taking enough memory that swapping is needed.

      Abigail

Re: The Upper Limit of Perl's Native Data Structures (RAM)
by tye (Sage) on May 21, 2003 at 16:00 UTC

    You are probably simply running out of RAM. As others have discussed, Perl hashes should usually be well behaved even when they get really huge. However, it is my experience that running out of RAM means that your code slows down exponentially as its need for RAM grows.

    One exception is if you can localize RAM access so that items you access at nearby times are also nearby in RAM (that is, clumped together in clusters; that is, only on a few pages of VM not each on a different page of VM). For example, looping over 2-D arrays in FORTRAN or C row-by-row can be much faster than column-by-column.

    But Perl makes heavy use of pointers and so really large Perl arrays don't usually lead to very localized memory use even when you stick to one part of the array. And really large Perl hashes jump all over in RAM.

    You could try tieing the hash to a DBM file. This will make the initial hash accesses quite a bit slower, but will prevent the exponential slow-down as the hash gets really large and so may well make your code run much faster in this specific case.

                    - tye
Re: The Upper Limit of Perl's Native Data Structures
by Elian (Parson) on May 21, 2003 at 15:08 UTC
    Part of your slowdown may come from the repeated reallocation of the base data structure for the hash. There's an array in there that holds the elements of the hash--when it's full, perl has to allocate a new, larger chunk and copy the data from the old chunk to the new one. (This also tends to frag memory) You can use keys in an lvalue context (keys %foo = 1000000) to presize the hash, which may help some.

    Depending on how well your keys hash out, you may find you're badly overloading one of the buckets in the hash structure as well. While perl tries to make sure the keys are evenly distributed by the hashing function, it doesn't always succeed, and in that case there's not a whole lot you can do. Printing the hash in scalar context will give you a little bit of info about that--if you see the two numbers returned are wildly different, you've got a badly mis-balanced hash. (In which case you're kinda out of luck)

Re: The Upper Limit of Perl's Native Data Structures
by no_slogan (Deacon) on May 21, 2003 at 18:24 UTC
    ...graphs with ~3000 nodes and ~1.5million edges.

    Building a graph with Graph::Directed doesn't put all the edges into one big hash. It creates an adjacency list for each vertex, so the size of each individual hash should only be about equal to the number of vertices. A 3000-element hash should be no problem, unless you're looking for trouble.

    As I insert this large graph into the Graph::Directed module I note what appears to be an exponential slowdown as I insert edges.

    I don't see a slowdown until the graph gets big enough to fill available ram and start swapping, so it sounds like tye is right. Graph::Directed's graph representation is designed to be flexible rather than compact. To handle big graphs, you might need to use a different representation.

    does anyone know the big-O notation complexity of inserting edges and nodes into the data structures used by Graph::Directed

    It looks like it should be close enough to O(n) as to make no difference.

    Aside to Abigail: Ha ha, but exp($n/1_000_000) is still exponential.

      Aside to Abigail: Ha ha, but exp($n/1_000_000) is still exponential.

      And you think the OP noticed that?

      Abigail

        And you think the OP noticed that?

        Yes. It sounds to me like he ran a series of tests that looked something like this:

        edges time --------- ----- 500_000 1 min 1_000_000 2 min 1_500_000 4 min

        And he recognized that as an exponential progression. In any case, there's no cause for making snide remarks about what he did or didn't notice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://259700]
Approved by Corion
Front-paged by Tanalis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (5)
As of 2024-04-18 00:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found