http://qs321.pair.com?node_id=11138654

beautyfulman has asked for the wisdom of the Perl Monks concerning the following question:

  • Comment on Anyone with XS experience willing to create a high performance data type for Perl?

Replies are listed 'Best First'.
Re: Anyone with XS experience willing to create a high performance data type for Perl?
by talexb (Chancellor) on Nov 10, 2021 at 01:38 UTC
      For Perl kind of performance Tree:RB is ultra fast but look at Ruby's standard red and black tree, it is several orders of magnitude faster. The Ruby gem rbtree is just C code bind for dict.c, you can download it here https://rubygems.org/gems/rbtree Inside the gem file there is dict.c and dict.h, as well as ruby.c which is the wrap.

    If Tree::RB takes 33 seconds and the Ruby gem rbtree takes 2 seconds, then it's an exaggeration to say it's 'several orders of magnitude faster'; based on those numbers, it's one order of magnitude faster.

    And if the Ruby solution is just C code that you can download, how about just using Inline::C to wrap that code and use it as is.

    However, I have to ask my favourite question, which is "What problem are you trying to solve?" Why are you looking for something that sorts stuff? Is there any way you could put the data you have into a database, and have it deal with whatever operation you're trying to do?

    The beauty of using a database is that lots and lots of hard work has already been done on handling millions of rows efficiently; handling filtering, sorting and selection of those rows is what a database does. You can also have multiple tasks that add data, grab chunks of it, and delete stuff that's been processed.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      > then it's an exaggeration to say it's 'several orders of magnitude faster'; based on those numbers, it's one order of magnitude faster.

      I used to confuse orders of magnitude with times for a long time, until I realized the latter former referred to the exponent (usually applied to 10). Nonetheless, the type of hubris in this post begs for master of these kinds of topics. Perhaps OP is willing to contract someone to do this, which I think would be the appropriate tact. That said, the initial step of using Inline::C you suggested I think is the right one. For additional speed up, if the algorithm is parallelizable, is to introduce OpenMP, which is a lot more straightforward with Alien::OpenMP (and OpenMP::Environment for good measure).

      Updated thanks to note by jdporter.

        An order of magnitude refers to the difference in exponent (or that's what I learned). To take an extreme case, 10 and 99 (1E1 and 9.9E1) are the same order of magnitude, but 99 and 100 (9.9E1 and 1E2) are different orders of magnitude.

        And sure, a 10x difference is big, but all of these things are relative. Unless you're working at FAANG scale, this may not add up to much. So your job runs for 10 seconds instead of a second? Meh. OK. Ten minutes instead of a minute? OK. Ten hours instead of one hour? Hmm .. that does take a while. Maybe don't run it every hour.

        Alex / talexb / Toronto

        Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

          OK -- we have a tiny bit more information, but I still don't know why a database isn't a suitable solution for whatever problem you're trying to solve. Like I said, database developers have spent decades making sure their platform is stable and efficient. Perhaps I should ask this a different way -- is there a reason why a database wouldn't work for you? It's still not clear what the problem is.

          Alex / talexb / Toronto

          Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

    Re: Anyone with XS experience willing to create a high performance data type for Perl?
    by dave_the_m (Monsignor) on Nov 10, 2021 at 11:32 UTC
      What are your actual requirements? Do you need a balanced tree, or on-disk storage or what? For example could you just use a perl hash and track min/max key values during insertion? Or just from time to time process a sorted list of the hash keys? The following trivial code on my laptop takes 1.2s to create the hash and a further 1.2s to find the current smallest and largest keys.
      my %data; for (1..2_000_000) { $data{$_} = "some random text $_"; } my ($min, $max) = (sort { $a <=> $b } keys %data)[0,-1];

      Dave.

            So again, tell us what you're trying to achieve.

            It is exceedingly unlikely that the outcome of this thread is that someone will take it upon themselves to put a fast btree implementation on CPAN which meets all your requirements. What might reasonably be an outcome is that someone will advise you of a way of achieving your requirement in a reasonably speedy way, e.g. by using CPAN module X or perl technique Y or interesting algorithm Z. But only if we know what you want.

            Dave.

            > Ideally need a balanced tree like Red Black or AVL, the Perl hash takes 4 seconds

            Not surprisingly, the Perl hash, at 4 seconds, is a lot faster than the (pure Perl) Tree::RB, at 33 seconds, when running your (unpublished) benchmark.

            Is there any functional reason why you can't just use a hash?

            Since you mentioned AVL, have you tried AVLTree from CPAN? It uses a XS wrapper around Julienne Walker's AVL Tree C library, so should be a lot faster than Tree::RB.

            References

            Disclaimer: I have no experience using any of these, just did a quick google for you.

            It is exceedingly unlikely that the outcome of this thread is that someone will take it upon themselves to put a fast btree implementation on CPAN which meets all your requirements.

            Seriously! I mean you'd need someone with XS expertise who also happens to know about Red/Black trees, and who reads this forum. And unless you're paying them it would need to be a weird masochist who actually enjoys XS, or maybe someone who finds systems-level work cathartic after a long multi-week stretch of web-app tedium. Or maybe if they just really love the Red/Black algorithm and had their own Red/Black implementation in C that they've been porting from project to project since college. Maybe if they had finished a big refactor of the code a few months ago but hadn't gotten to use it in a project yet, that could be enough motivation.

            Congratulations, the stars aligned.

            Here's your module. Tree-RB-XS-0.00_01.tar.gz
            (not visible on metacpan.org yet)

            You get the distinct privilege of reporting the first bugs in a recently-refactored pointer-math-heavy C library wrapped with some creative new XS ideas.

            Update: I added a new KEY_TYPE_BSTR that copies the keys from Perl into plain buffers, and uses memcmp on them. It's a good percentage faster at the expense of incorrect unicode sorting. It's useful if your strings are ASCII.

            Update: I finished off most of the Tree::RB API, polished up the documentation, and gave it an official release. It also now has custom XS compare functions to choose from, such as CMP_NUMSPLIT, which can handle the fairly common scenario of a mixture of numbers and strings in the key.

      Re: Anyone with XS experience willing to create a high performance data type for Perl?
      by eyepopslikeamosquito (Archbishop) on Nov 10, 2021 at 02:52 UTC

        It seems to me you are leaping to conclusions. This sort of thing happens so often that I now have a node explaining my point of view:

              In all software companies I've worked at, human costs dwarf hardware costs. Even twenty years ago, I remember being told that it's cheaper to buy a new hard disk than to hold a meeting to discuss if we really need one or not. :)

        Re: Anyone with XS experience willing to create a high performance data type for Perl?
        by swl (Parson) on Nov 10, 2021 at 04:53 UTC

          It's worth looking at FFI::Platypus if you want to develop bindings and avoid XS.

        Re: Anyone with XS experience willing to create a high performance data type for Perl?
        by eyepopslikeamosquito (Archbishop) on Nov 10, 2021 at 06:53 UTC

          I benchmarked all data structures i could find on CPAN and Perl's built-in data types
          Can we see/run your benchmark code? How did the performance differ between RB trees (written in pure Perl?) vs Perl's built-in hashes?

        Re: Anyone with XS experience willing to create a high performance data type for Perl? (Updated)
        by bliako (Monsignor) on Nov 11, 2021 at 14:27 UTC

          I have created a test perl script which uses Inline::C and a heavily modified version of dict.c you have linked. Changes were necessary to make Inline::C understand the code. For example it seems to me that it dislikes function pointers inside struct. So I have set all function pointers to void*. One should typecast when necessary. It also seemed to me it had problems with enums within struct. I have replaced them with #define and integers.

          beautyfulman, it is now your call. Provide C code to test the functionality within C of dict.c. For example: create dictionary, add items, delete items, find items.

          EDIT: Important: in order for this code to work, I have made a change to Inline/Struct.pm. The original code I posted here needs that change in order to "work". Because I don't have test code I do not know whether my change will work or not, I know it compiles the code. That's all. So, I am now modifying the code below in order not to rely on Inline::Struct. The new code below will work with Inline::C (I have put comments where I did the changes).

          bw, bliako

            If there are shortcomings with Inline::Struct, please file an issue so they have a chance of being fixed :-)

              I will, thanks. The above was just a quick hack to get us going.

              Just for the record, the two problems I encountered (and could very much be red herrings) are that Inline::Struct failed when a C struct contained (EDIT: added next word:) typedef function pointers, I believe it did not understand the arcane C syntax (I love it though) : int (*funcpoi)(int, int); . It also did not like typedef enums within a struct. The above is just for the record, I will file a proper report at the channel you mentioned.

              bw, bliako

        Re: Anyone with XS experience willing to create a high performance data type for Perl?
        by bliako (Monsignor) on Nov 10, 2021 at 09:38 UTC
          #include <ruby.h>

          have you really thought that properly?

          (for the record: https://github.com/skade/rbtree/blob/master/dict.c )

          Unless you can provide a stand-alone rbtree.c, I don't find adding all ruby dependencies a good idea. When you eventually have such C code, use Inline::C to prototype it. The difficult part would be to provide the perl-vars into the C functions. There's documentation and a lot of examples though, so just a matter of time and effort. I am sure you will have something going for scalars at least in short time. And take it from there. See also: Re^3: XS, raspberry pi, and a hundred bucks on how to use Inline::C to bootstrap an XS project. Please post here again when you have something more concrete, and/or msg me.

          bw, bliako

            I think the OP should first show some reproducible benchmarks before jumping to action.

            There have been some new monks lately making bold demands without providing details.

            Cheers Rolf
            (addicted to the Perl Programming Language :)
            Wikisyntax for the Monastery

                  You are right. In the link I posted, dict.c depends on ruby.h but the link you posted does not, the dependency for ruby.h is now in rbtree.c. (I posted that link as I was lazy, to avoid finding out how to extract source code from gem files). Do you have test code to demonstrate use of dict operations implemented in dict.c?

            Re: Anyone with XS experience willing to create a high performance data type for Perl?
            by syphilis (Archbishop) on Nov 11, 2021 at 00:33 UTC
              Inside the gem file there is dict.c and dict.h, as well as ruby.c which is the wrap.

              What's the easiest way for me to obtain the contents of dict.c and dict.h (in human readable form) from the gem file ?
              I don't have any Ruby software on my PC.

              Cheers,
              Rob
                tar xvf rbtree-0.4.4.gem && tar xvzf data.tar.gz
              Re: Anyone with XS experience willing to create a high performance data type for Perl?
              by perlfan (Vicar) on Nov 10, 2021 at 10:43 UTC
                Have you looked at PDL? It was created explicitly to provide high performance data types for Perl. It might not have the algos you're looking for, but I can't be so quick to say that it doesn't.

                Also, these bold requests would be easier on you if you approached it as something you're trying to do. It's easier to get help along the way than the approach you're taking here.

                  PDL (saying this as a not-very-surprising fan) is best employed on rectangular data. I don't think tree structures are a very natural fit, although there are exceptions for some graph-theory problems!
              Re: Anyone with XS experience willing to create a high performance data type for Perl?
              by erix (Prior) on Nov 13, 2021 at 11:10 UTC

                A benchmark would be nice (indispensable, really)

                (Inserting 2M items into a database table takes under 1 second on an old desktop here - btree *shrug*)