threads::shared seems to kill performance

Jacobs has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: threads::shared seems to kill performance by dave_the_m (Monsignor) on Jul 17, 2013 at 22:42 UTC
threads::shared variables are very slow; you should share little, and access it not a lot. The implementation essentially does a similar thing to tieing (except that it's implemented in XS rather than perl); so `my %hash : shared; ... $x = $hash{foo};` [download] is a bit like `sub threads::shared::FETCH { lock $Some:Global:lock_var; return $Some::Shared::Space::hash{$_[0]}; } my %hash; tie %hash, 'threads::shared'; ...` [download] Note that each thread has its own copy of the 'tied' hash; accessing it causes a global lock to be set, then an entry from the 'real' hash is copied to that thread. Dave.	[reply] [d/l] [select]
Re: threads::shared seems to kill performance by BrowserUk (Patriarch) on Jul 18, 2013 at 00:33 UTC
Yes, shared aggregates are considerably slower than non-shared. But try it this way and it'll be about 2/3rds less slow: `use threads; use threads::shared; my %hashOf1000SharedHashes = map{ $_ => &share({}) } 1 .. 1000; my %data:shared; foreach my $x (1..5000) { $data{$x} = shared_clone( \%hashOf1000SharedHashes ); } undef %hashOf1000SharedHashes;` [download] That said, building a 2D HoH of empty hashes (with consecutive numerical indices?) doesn't seem very useful. Presumably that structure will need to be populated at some point -- and with that amount of data it must becoming in from outside the program -- and once you add the IO to fetch the data into the mix, the cost of making the data shared will pale into insignificance. If instead of building a huge, empty shared data structure, and then populating it, (which will take considerable further time), you shared and populated it in one pass, you'd save considerable time and the sharing costs would almost disappear in amongst the IO costs. Tell us more about what goes in this monster, where that comes from; and how it is used and we'll probably be able to help you save a lot of time. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^2: threads::shared seems to kill performance by Jacobs (Novice) on Jul 18, 2013 at 05:07 UTC
Hello BrowserUK, threading master of masters from what I hear! Thank you for your response. I'm aware I'm probably breaking several laws and killing small kittens in the process by allocating hash array this big. Originally the data comes from a SQLite database. There's on huge table that's tied via 2 levels of parameters - say: owner, date, some_data (with <owner,date being> unique and set of owners relatively small) - and with loading this into those hashes, I'm trying to introduce some structure to the data so that I can later access it from my program in a way I can easily understand and work with ($data{user}{date}[]). Strangely loading the data from the database doesn't have as big of an impact on the performance as the sharing does. In my real life tests - where I in fact do initialize the hash and populate it in one pass as you suggest - the loading from DB and populating the hash (with significantly reduced set of data) took about 2s. Once I added the sharing (in a way similar to my example above), it took about 26s.	[reply]
Re^3: threads::shared seems to kill performance by BrowserUk (Patriarch) on Jul 18, 2013 at 06:04 UTC
Originally the data comes from a SQLite database.... Then I very strongly advise against taking the data out of the db and putting it into a hash. Not only will doing so take considerable time and substantial space, although for read-only use you won't need locking, there is no way to turn off the locking Perl uses to protect its internals, and that will bring your application to a crawl. Instead, share the db handle and create statement handles for your queries. Whilst I haven't done this personally (yet), according to this, the default 'serialized' mode of operation means that you don't even need to do user locking as the DB will take care of that for you. If you create/clone your DB as an in-memory DB, after you've spawned your threads; then you will avoid the duplication of that DB and the performance should be on a par with, and potentially faster than a shared hash. When I get time, which may not be soon, I intend to test this scenario for myself as I think it might be a good solution to sharing large amounts of data between threads. Something Perl definitely needs. It may even be possible to wrap it over in a tied hash to simplify the programmers view of the DB without incurring the high overheads of threads::shared (That's very speculative!). In any case, as your data is already in a DB; don't take it out and put it in shared hashes. That just doesn't make sense. Just load it into memory after you threads are spawned; and then set the dbh into a shared variable where the threads can get access to it. At least, that is what I would (and will) try. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: threads::shared seems to kill performance by roboticus (Chancellor) on Jul 18, 2013 at 11:09 UTC
Re^4: threads::shared seems to kill performance by Jacobs (Novice) on Jul 18, 2013 at 20:50 UTC
Re^4: threads::shared seems to kill performance by Jacobs (Novice) on Jul 19, 2013 at 02:05 UTC
Re^5: threads::shared seems to kill performance (Workaround). by BrowserUk (Patriarch) on Jul 19, 2013 at 08:24 UTC
Some notes below your chosen depth have not been shown here
Re^5: threads::shared seems to kill performance (DBI broken) by BrowserUk (Patriarch) on Jul 19, 2013 at 05:26 UTC
Re: threads::shared seems to kill performance by Preceptor (Deacon) on Jul 17, 2013 at 21:44 UTC
Hmm, well, I'd sort of expect 5,000,000 &share calls to take a reasonable amount of time, yes. Hashes - particularly multidimensional ones - don't work well with thread::shared. What you've got is essentially a fudge that creates a lot of separate anonymous hashes, and links them together. However if - as you say - your data is read only from your threads, you might not need to do that - if you initialise prior to instantiating your threads, they'll take a copy of your global namespace anyway. You just won't be able to modify it within the thread (or technically - you can, but it won't replicate to other threads).	[reply]
Re^2: threads::shared seems to kill performance by Jacobs (Novice) on Jul 18, 2013 at 04:51 UTC
Thank you. I considered not-sharing, but that would effectively mean each thread would be setup with a copy of the original 240MB array, would it not? I was afraid this would quickly kill my memory, but thinking about it now, isn't there a chance this would be copy on write only? And thus even 1000 threads would (considering I only do reads) still only use 240MBs of memory?	[reply]
Re^3: threads::shared seems to kill performance by Preceptor (Deacon) on Jul 18, 2013 at 08:16 UTC
Couldn't say myself, without trying it. I know some modes of parallel processing take memory copy-on-write, and others don't. I'm pretty sure the Unix 'fork' does that, for example. I've never had occasion to check whether threads do too. It may not be viable, but depending on frequency of reading array, you might find you can have a 'handler' thread, that services requests for data from the hash Otherwise - your code is all about initially creating the hash. How does it perform once that's finished? It may be worth the overhead.	[reply]


laziness, impatience, and hubris
	PerlMonks