Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Anyone with XS experience willing to create a high performance data type for Perl?

by beautyfulman (Sexton)
on Nov 10, 2021 at 00:56 UTC ( #11138654=perlquestion: print w/replies, xml ) Need Help??

beautyfulman has asked for the wisdom of the Perl Monks concerning the following question:

I was looking specifically for a Red and Black Tree module on CPAN but as i was somewhat flexible I benchmarked all data structures i could find on CPAN and Perl's built-in data types. They are super slow with the exception of Tree::RB and DB_File in-memory DB TREE mode which are very fast.

Ignoring the super slow stuff here are the results for inserting two million key-value pairs and then finding the max and min value key in the data structure. As i compared different languages i opted to use linux's time function instead of specialized in-language benchmark, not the best option but gives you a picture. Set up for the test is Linux under a VMPlayer virtual machine i7-8700 reserved for the VM were 3 cores and 6 threads.

Perl module Tree::RB time 33 seconds.
Perl module DB_File with the above specifications and File::Utils to find min and max 19 seconds.
Edit because some people asked: Perl built-in hash: 4 seconds
Julia built-in Dict data type 2 seconds
Ruby gem rbtree 2 seconds
Last edit ever: Perl module Tree::RB::XS 1 second Code used for Perl XS and Ruby rbtree https://pastebin.com/z75dSzFM

For Perl kind of performance Tree::RB is ultra fast but look at Ruby's standard red and black tree, it is a lot faster. The Ruby gem rbtree is just C code bind for dict.c, you can download it here https://rubygems.org/gems/rbtree. Inside the gem file there is dict.c and dict.h, as well as ruby.c which is the wrap.

I wonder if someone specialized in XS would be willing to create the same Ruby wrap for Perl? XS is hard and i have zero experience with it.

The reasonings are many: Apache Kafka message broker can produce more than one million messages per second, Perl's text processing can get much faster, Data Science performance and so on.

Edit: NERDVANA created the XS version Tree::RB::XS

  • Comment on Anyone with XS experience willing to create a high performance data type for Perl?

Replies are listed 'Best First'.
Re: Anyone with XS experience willing to create a high performance data type for Perl?
by talexb (Chancellor) on Nov 10, 2021 at 01:38 UTC
      For Perl kind of performance Tree:RB is ultra fast but look at Ruby's standard red and black tree, it is several orders of magnitude faster. The Ruby gem rbtree is just C code bind for dict.c, you can download it here https://rubygems.org/gems/rbtree Inside the gem file there is dict.c and dict.h, as well as ruby.c which is the wrap.

    If Tree::RB takes 33 seconds and the Ruby gem rbtree takes 2 seconds, then it's an exaggeration to say it's 'several orders of magnitude faster'; based on those numbers, it's one order of magnitude faster.

    And if the Ruby solution is just C code that you can download, how about just using Inline::C to wrap that code and use it as is.

    However, I have to ask my favourite question, which is "What problem are you trying to solve?" Why are you looking for something that sorts stuff? Is there any way you could put the data you have into a database, and have it deal with whatever operation you're trying to do?

    The beauty of using a database is that lots and lots of hard work has already been done on handling millions of rows efficiently; handling filtering, sorting and selection of those rows is what a database does. You can also have multiple tasks that add data, grab chunks of it, and delete stuff that's been processed.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      > then it's an exaggeration to say it's 'several orders of magnitude faster'; based on those numbers, it's one order of magnitude faster.

      I used to confuse orders of magnitude with times for a long time, until I realized the latter former referred to the exponent (usually applied to 10). Nonetheless, the type of hubris in this post begs for master of these kinds of topics. Perhaps OP is willing to contract someone to do this, which I think would be the appropriate tact. That said, the initial step of using Inline::C you suggested I think is the right one. For additional speed up, if the algorithm is parallelizable, is to introduce OpenMP, which is a lot more straightforward with Alien::OpenMP (and OpenMP::Environment for good measure).

      Updated thanks to note by jdporter.

        An order of magnitude refers to the difference in exponent (or that's what I learned). To take an extreme case, 10 and 99 (1E1 and 9.9E1) are the same order of magnitude, but 99 and 100 (9.9E1 and 1E2) are different orders of magnitude.

        And sure, a 10x difference is big, but all of these things are relative. Unless you're working at FAANG scale, this may not add up to much. So your job runs for 10 seconds instead of a second? Meh. OK. Ten minutes instead of a minute? OK. Ten hours instead of one hour? Hmm .. that does take a while. Maybe don't run it every hour.

        Alex / talexb / Toronto

        Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Ok for the "orders of magnitude" but that is a huge difference.
      In my case the data type is to consume Kafka messages and process it in real time. A Red Black Tree is the most often used data type for that, a file based DB is not suitable for that.

        OK -- we have a tiny bit more information, but I still don't know why a database isn't a suitable solution for whatever problem you're trying to solve. Like I said, database developers have spent decades making sure their platform is stable and efficient. Perhaps I should ask this a different way -- is there a reason why a database wouldn't work for you? It's still not clear what the problem is.

        Alex / talexb / Toronto

        Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

Re: Anyone with XS experience willing to create a high performance data type for Perl?
by dave_the_m (Monsignor) on Nov 10, 2021 at 11:32 UTC
    What are your actual requirements? Do you need a balanced tree, or on-disk storage or what? For example could you just use a perl hash and track min/max key values during insertion? Or just from time to time process a sorted list of the hash keys? The following trivial code on my laptop takes 1.2s to create the hash and a further 1.2s to find the current smallest and largest keys.
    my %data; for (1..2_000_000) { $data{$_} = "some random text $_"; } my ($min, $max) = (sort { $a <=> $b } keys %data)[0,-1];

    Dave.

      Ideally need a balanced tree like Red Black or AVL, the Perl hash takes 4 seconds in that VM setup. I used insert and min max only to have a raw idea , some kind of operations are appropriate for trees but not hashes.
        So again, tell us what you're trying to achieve.

        It is exceedingly unlikely that the outcome of this thread is that someone will take it upon themselves to put a fast btree implementation on CPAN which meets all your requirements. What might reasonably be an outcome is that someone will advise you of a way of achieving your requirement in a reasonably speedy way, e.g. by using CPAN module X or perl technique Y or interesting algorithm Z. But only if we know what you want.

        Dave.

        > Ideally need a balanced tree like Red Black or AVL, the Perl hash takes 4 seconds

        Not surprisingly, the Perl hash, at 4 seconds, is a lot faster than the (pure Perl) Tree::RB, at 33 seconds, when running your (unpublished) benchmark.

        Is there any functional reason why you can't just use a hash?

        Since you mentioned AVL, have you tried AVLTree from CPAN? It uses a XS wrapper around Julienne Walker's AVL Tree C library, so should be a lot faster than Tree::RB.

        References

        Disclaimer: I have no experience using any of these, just did a quick google for you.

        It is exceedingly unlikely that the outcome of this thread is that someone will take it upon themselves to put a fast btree implementation on CPAN which meets all your requirements.

        Seriously! I mean you'd need someone with XS expertise who also happens to know about Red/Black trees, and who reads this forum. And unless you're paying them it would need to be a weird masochist who actually enjoys XS, or maybe someone who finds systems-level work cathartic after a long multi-week stretch of web-app tedium. Or maybe if they just really love the Red/Black algorithm and had their own Red/Black implementation in C that they've been porting from project to project since college. Maybe if they had finished a big refactor of the code a few months ago but hadn't gotten to use it in a project yet, that could be enough motivation.

        Congratulations, the stars aligned.

        Here's your module. Tree-RB-XS-0.00_01.tar.gz
        (not visible on metacpan.org yet)

        You get the distinct privilege of reporting the first bugs in a recently-refactored pointer-math-heavy C library wrapped with some creative new XS ideas.

        Update: I added a new KEY_TYPE_BSTR that copies the keys from Perl into plain buffers, and uses memcmp on them. It's a good percentage faster at the expense of incorrect unicode sorting. It's useful if your strings are ASCII.

        Update: I finished off most of the Tree::RB API, polished up the documentation, and gave it an official release. It also now has custom XS compare functions to choose from, such as CMP_NUMSPLIT, which can handle the fairly common scenario of a mixture of numbers and strings in the key.

Re: Anyone with XS experience willing to create a high performance data type for Perl?
by eyepopslikeamosquito (Bishop) on Nov 10, 2021 at 02:52 UTC

    It seems to me you are leaping to conclusions. This sort of thing happens so often that I now have a node explaining my point of view:

      One kind of bottleneck you do not mention is money. If X runs 15 times slower than Y and you are paying fo r a 1500ghz CPU to run X, it is like ordering a 1500ghz CPU but instead receiving a 100mhz CPU for the same price.

        In all software companies I've worked at, human costs dwarf hardware costs. Even twenty years ago, I remember being told that it's cheaper to buy a new hard disk than to hold a meeting to discuss if we really need one or not. :)

Re: Anyone with XS experience willing to create a high performance data type for Perl?
by swl (Vicar) on Nov 10, 2021 at 04:53 UTC

    It's worth looking at FFI::Platypus if you want to develop bindings and avoid XS.

Re: Anyone with XS experience willing to create a high performance data type for Perl?
by eyepopslikeamosquito (Bishop) on Nov 10, 2021 at 06:53 UTC

    I benchmarked all data structures i could find on CPAN and Perl's built-in data types
    Can we see/run your benchmark code? How did the performance differ between RB trees (written in pure Perl?) vs Perl's built-in hashes?

Re: Anyone with XS experience willing to create a high performance data type for Perl? (Updated)
by bliako (Monsignor) on Nov 11, 2021 at 14:27 UTC

    I have created a test perl script which uses Inline::C and a heavily modified version of dict.c you have linked. Changes were necessary to make Inline::C understand the code. For example it seems to me that it dislikes function pointers inside struct. So I have set all function pointers to void*. One should typecast when necessary. It also seemed to me it had problems with enums within struct. I have replaced them with #define and integers.

    beautyfulman, it is now your call. Provide C code to test the functionality within C of dict.c. For example: create dictionary, add items, delete items, find items.

    EDIT: Important: in order for this code to work, I have made a change to Inline/Struct.pm. The original code I posted here needs that change in order to "work". Because I don't have test code I do not know whether my change will work or not, I know it compiles the code. That's all. So, I am now modifying the code below in order not to rely on Inline::Struct. The new code below will work with Inline::C (I have put comments where I did the changes).

    bw, bliako

      If there are shortcomings with Inline::Struct, please file an issue so they have a chance of being fixed :-)

        I will, thanks. The above was just a quick hack to get us going.

        Just for the record, the two problems I encountered (and could very much be red herrings) are that Inline::Struct failed when a C struct contained (EDIT: added next word:) typedef function pointers, I believe it did not understand the arcane C syntax (I love it though) : int (*funcpoi)(int, int); . It also did not like typedef enums within a struct. The above is just for the record, I will file a proper report at the channel you mentioned.

        bw, bliako

Re: Anyone with XS experience willing to create a high performance data type for Perl?
by bliako (Monsignor) on Nov 10, 2021 at 09:38 UTC
    #include <ruby.h>

    have you really thought that properly?

    (for the record: https://github.com/skade/rbtree/blob/master/dict.c )

    Unless you can provide a stand-alone rbtree.c, I don't find adding all ruby dependencies a good idea. When you eventually have such C code, use Inline::C to prototype it. The difficult part would be to provide the perl-vars into the C functions. There's documentation and a lot of examples though, so just a matter of time and effort. I am sure you will have something going for scalars at least in short time. And take it from there. See also: Re^3: XS, raspberry pi, and a hundred bucks on how to use Inline::C to bootstrap an XS project. Please post here again when you have something more concrete, and/or msg me.

    bw, bliako

      I think the OP should first show some reproducible benchmarks before jumping to action.

      There have been some new monks lately making bold demands without providing details.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        There is nothing bold. A red and black tree is always faster .

        Insert, Delete, iterating from a node to the next, search, asc and desc ordering. Do it as fast in Perl as it is with Ruby/C rbtree, i was unable to.

      Only for notice the link you posted is not the same gem file and maybe not the same dict.c, i never tested that one specifically

        You are right. In the link I posted, dict.c depends on ruby.h but the link you posted does not, the dependency for ruby.h is now in rbtree.c. (I posted that link as I was lazy, to avoid finding out how to extract source code from gem files). Do you have test code to demonstrate use of dict operations implemented in dict.c?

Re: Anyone with XS experience willing to create a high performance data type for Perl?
by syphilis (Archbishop) on Nov 11, 2021 at 00:33 UTC
    Inside the gem file there is dict.c and dict.h, as well as ruby.c which is the wrap.

    What's the easiest way for me to obtain the contents of dict.c and dict.h (in human readable form) from the gem file ?
    I don't have any Ruby software on my PC.

    Cheers,
    Rob
      tar xvf rbtree-0.4.4.gem && tar xvzf data.tar.gz
      gem unpack some_file.gem But if you do not want to download Ruby or Gem standalone, the gem file is just a .tar file, there are other .gz files inside it and you extract them
Re: Anyone with XS experience willing to create a high performance data type for Perl?
by perlfan (Vicar) on Nov 10, 2021 at 10:43 UTC
    Have you looked at PDL? It was created explicitly to provide high performance data types for Perl. It might not have the algos you're looking for, but I can't be so quick to say that it doesn't.

    Also, these bold requests would be easier on you if you approached it as something you're trying to do. It's easier to get help along the way than the approach you're taking here.

      PDL (saying this as a not-very-surprising fan) is best employed on rectangular data. I don't think tree structures are a very natural fit, although there are exceptions for some graph-theory problems!
Re: Anyone with XS experience willing to create a high performance data type for Perl?
by erix (Prior) on Nov 13, 2021 at 11:10 UTC

    A benchmark would be nice (indispensable, really)

    (Inserting 2M items into a database table takes under 1 second on an old desktop here - btree *shrug*)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11138654]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2022-01-21 18:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:












    Results (59 votes). Check out past polls.

    Notices?