Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Challenge: Simple algorithm for continuing series of integers

by moritz (Cardinal)
on Oct 19, 2008 at 19:26 UTC ( [id://718080]=perlquestion: print w/replies, xml ) Need Help??

moritz has asked for the wisdom of the Perl Monks concerning the following question:

Background

Perl 6 introduces the series operator, infix:<...>. Its normal behavior is to acccept a closure on the right side, and executes that closure to get next value, so you could write

my @even = 0, 2, 4 ... { $_ + 2}; my @powers = 1, 2, 4 ... { $_ * 2 }; my @fib = 1, 1, 2 ... { $^a + $^b};

Now there's a special, "magic" behavior if the right side isn't a closure but rather the Whatever star *:

my @ints = 1, 2, 3, 4 ... *; my @even = 0, 2, 4, 6 ... *; my @powers = 1, 2, 4, 8 ... *;

The spec says about that operator

If the right operand is * (Whatever) and the sequence is obviously arithmetic or geometric, the appropriate function is deduced:

1, 3, 5 ... * # odd numbers 1, 2, 4 ... * # powers of 2

Conjecture: other such patterns may be recognized in the future, depending on which unrealistic benchmarks we want to run faster. :)

The Challenge

It's not hard to detect arithmetic or geometric sequences and to continue them. My interest is more in an elegant, general algorithm for continuing such series. If you can provide such an algorithm, it might even make its way into the Perl 6 specs.

The criteria are

  • Principle of least surprise: If there's no obvious solution, bail out (or return undef) rather than making up a solution.
  • Simplicity: You should neither need a degree in computer science nor in mathematics to understand what's going on.
  • Generality: The more sequences it can detect and continue, the better. When a sequence is detected, it should be able to return arbitrary many items of that sequence.
  • Speed: Before you spend too much time in that operator rather give up (that implies no support for the sequence of prime numbers, since you can't easily find the next prime number).
  • Integers only, for now. Don't care about floating points and rounding errors for now.

(Using a dump of the OEIS database doesn't meet these criteria).

My thoughts

My thoughts in spoiler tags below. Please try to find your solution prior to reading that, I don't want to bias you into a particular direction.

Here is my idea for detecting sequences, in schematic Perl 6:

sub series(@items, $recursion_level = 2) { return if $recursion_level < 0; return if @items < 2; # are all items the same? return @items[0] if [==] @items; # detect arithmetic sequences my @diffs = map { @items[$_+1] - @items[$_] }, 0 .. (@items - 2); my $d = series(@diffs, $recursion_level - 1); return @items[*-1] + $d if $d.defined; # detect geometric sequences my $r = try { my @ratios = map { @items[$_+1] / @items[$_] }, 0 .. (@items - + 2); series(@ratios, $recursion_level - 1); } return @items[*-1] * $r if $r.defined; # give up return; }

Or translated to Perl 5 (which also re-uses some variables):

use List::MoreUtils qw(all); sub series { _series(2, @_); } sub _series { my $recursion_level = shift; return if $recursion_level < 0; return if @_ < 2; my $first = $_[0]; if (all { $_ == $first } @_) { return $first; } my @a = map { $_[$_+1] - $_[$_] } 0 .. (@_ - 2); my $r = _series($recursion_level - 1, @a); return $_[-1] + $r if defined $r; # catch division by zero $r = eval { @a = map { $_[$_+1] / $_[$_] } 0 ... (@_ - 2); _series($recursion_level - 1, @a); }; return $_[-1] * $r if defined $r; return; }

It detects the case when all items are equal, and otherwise calculates the pair-wise difference and ratio, and recurses.

That will work for geometric and arithmetic sequences, but for example fails for the fibonacci numbers. It would need some kind of correlation between the difference and the original sequence to make that work.

It passes 16 of the 20 tests below.

I've also written a small test script, but that should more serve as a starting point than as an authorotative test.

#!/usr/bin/perl use strict; use warnings; use Test::More qw(no_plan); sub series { # your code here. } my @tests = ( [[1], undef], [[1, 1], 1], [[0, 0], 0], [[1, 2], undef], [[0, 1, 2], 3], [[1, 0, -1], -2], [[1, 2, 3], 4], [[1, 2, 4], 8], # powers [[2, 4, 8], 16], [[1, 3, 9], 27], [[1, -1, 1, -1], 1], # alternations [[-1, 1, -1, 1], -1], [[1, 0, 1, 0], 1], [[0, 1, 0, 1], 0], [[1, 1, 2, 3, 5], 8], # fibonacci [[0, 1, 1, 2, 3], 8], [[1, 2, 3, 5, 8], 13], [[1, 2, 6, 24, 120], 720], # factorials [[1, 0, 0, 1], undef], [[1, 2, 3, 1], undef], ); for my $t (@tests) { my $expected = defined $t->[1] ? $t->[1] : 'undef'; my $result = series(@{$t->[0]}); $result = 'undef' unless defined $result; is $result, $expected, "seq " . join(', ', @{$t->[0]}); }

Update: Thanks for all the replies. I have to mull it over in my head, there are certainly good ideas and valid concern in the replies.

Replies are listed 'Best First'.
Re: Challenge: Simple algorithm for continuing series of integers
by blokhead (Monsignor) on Oct 19, 2008 at 22:07 UTC
    BTW, this test case is wrong:
    [[0, 1, 1, 2, 3], 8],
    Here is my approach, which will at least identify the homogenous linear recurrences (basically everything here except the factorial example). The main work is done by candidate_recurrence. A degree-D linear recurrence is something like this:
    a(n) = c(1)a(n-1) + ... + c(D)a(n-D)
    where the c(i) values are constant coefficients. Any 2D-1 values uniquely define a degree-D linear recurrence, and that recurrence is what candidate_recurrence computes. It also takes another parameter which says whether to consider a recurrence with a constant additive term as well, like
    a(n) = c(1)a(n-1) + ... + c(D)a(n-D) + c(0)
    In either case, it's just some basic linear algebra, for which the code currently uses PDL to solve.

    So given this candidate_recurrence, the sequence-identification algorithm is pretty simple. Just start with D very low and keep increasing it until you reach one that works with the entire sequence (low D = low complexity, so this is a kind of Occam's razor simplest (linear) explanation for the sequence).

    This works for all of your examples except for the factorials (not a homogenous linear recurrence). For test cases (1) and (1,2), it gives reasonable answers right now (all-ones sequence, and powers of two respectively), where your test wants it to fail. And for test case (1,2,3,1), there are precision issues with PDL's linear algebra, which is unfortunate. Still, all we need is to take the inverse of an integer matrix. There are no inherent issues with such a computation in general, I just didn't want to do it by hand, and didn't realize that PDL had precision issues!

    As for extending this to work with more general sequences, it seems like it would be quite difficult to get something that works with too many more classes of sequences. Fortunately linear recurrences encompass many common use cases. When your "magic star" recognizes 1, 1, 2, 5, 14, 42, 132 as the first few Catalan numbers, and 2, 6, 20, 70, 252 as the first few central binomial coefficients, you'll be onto something special ;)

    Update: Replacing your main loop with the following

    for my $t (@tests) { my @list = @{$t->[0]}; print "@list : "; my $R = identify(@list); if ($R) { print "identified as ", $R->print, " : next = ", $R->next(@lis +t), $/; } else { print "??\n"; } }
    .. gives some more verbose output:
    1 : identified as a(n) = 1 : next = 1 1 1 : identified as a(n) = 1 : next = 1 0 0 : identified as a(n) = 0 : next = 0 1 2 : identified as a(n) = 2*a(n-1) : next = 4 0 1 2 : identified as a(n) = a(n-1) + 1 : next = 3 1 0 -1 : identified as a(n) = a(n-1) + -1 : next = -2 1 2 3 : identified as a(n) = a(n-1) + 1 : next = 4 1 2 4 : identified as a(n) = 2*a(n-1) : next = 8 2 4 8 : identified as a(n) = 2*a(n-1) : next = 16 1 3 9 : identified as a(n) = 3*a(n-1) : next = 27 1 -1 1 -1 : identified as a(n) = -a(n-1) : next = 1 -1 1 -1 1 : identified as a(n) = -a(n-1) : next = -1 1 0 1 0 : identified as a(n) = -a(n-1) + 1 : next = 1 0 1 0 1 : identified as a(n) = -a(n-1) + 1 : next = 0 1 1 2 3 5 : identified as a(n) = a(n-1) + a(n-2) : next = 8 0 1 1 2 3 : identified as a(n) = a(n-1) + a(n-2) : next = 5 1 2 3 5 8 : identified as a(n) = a(n-1) + a(n-2) : next = 13 1 2 6 24 120 : ?? 1 0 0 1 : ?? 1 2 3 1 : identified as a(n) = 5*a(n-1) + -7*a(n-2) : next = -16 1 3 5 : identified as a(n) = a(n-1) + 2 : next = 7 2 4 6 : identified as a(n) = a(n-1) + 2 : next = 8

    blokhead

      Your reply is pretty similar to the reply I had in mind and started to write and then quickly abandoned. Thanks. :)

      Some minor differences: I planned to strictly limit the complexity of generator that would be considered, nothing beyond $s[$n]= $a + $b*$s[$n-1] + $c*$s[$n-2];. I find that the next order of generator is too opaque for the reader of the code and so isn't something I'd have automated (the expense also climbs). So I had no intention of using PDL.

      But I also wanted to request that guessing be made much safer than proposed. I don't find "1,1" compellingly unique nor "1,3,5". I'd want at least "1,1,1" and "1,3,5,7" to be required.

      Actually, I'd take the last two items off the end of the presented start of the sequence and use that to compute the simplest matching sequence generator from some simple class(es) of generators. Then I'd only use it if the next two numbers matched the guessed-at sequence. If the numbers don't match, then I'd reject the code, not try to guess again.

      For something this magical, it is good to have a couple of "check digits" so that it stays DWIM not DoSomethingSurprising. Making a single typo in one number should very much not just happily pick a sequence that I didn't intend to describe. And people are very prone to expect other people to see things as they themselves currently see things and so will type "1,2,4,7" and fully expect that the next item is "obvious" (and then will come back a few weeks later and have no idea what they had in mind when they wrote that). For this type of magic, it is incumbent upon Perl to require the author to be more explicit than they naturally will think that they need to be.

      So, if you have three numbers in your example, you have to predict the sequence from the first number alone. The simplest prediction is a constant list, so that is the only type of list that can be generalized from just three items. So fewer than 3 items or a sequence of 3 items not all the same would always refuse to be generalized.

      If you have 4 items, then you must predict based on only the first two. The simplest prediction would be a linear progression. So "1,2,3,4 ... *" would work as would "1,3,5,7 ... *".

      If you have 5 items, then you can use the first three to predict. So we'd solve the following two simultaneous equations:

      $s[1]= $a + $b*$s[0] $s[2]= $a + $b*$s[1]

      Or just:

      $b= ($s[2]-$s[1]) / ($s[1]-$s[0]); $a= $s[1] - $b*$s[0];

      (If the first two items are the same, then set $b to 0 which gives us a constant sequence again and generalizing fails unless the third item is also the same.)

      So for 1,2,4,$x,$y ... * Perl guesses $b= 2 so $a= 0 and fails unless $x is 8 and $y is 16.

      When we get to 1,4,9,16,$x,$y ... *, the human thinks "$n**2" and Perl 6 computes $a= 2, $b= 2, $c= -1, that is $s[$n]= 2 + 2*$s[$n-1] - $s[$n-2]; since (9) == 2 + 2*(4) - (1) and (16) == 2 + 2*(9) - (4). Did Perl DWIM? I leave that as an exercise for the reader.

      - tye        

        I agree, the magic-star should generally only identify a sequence that is nicely overspecified. But in the interest of Huffman coding the language I think that if anything warrants an exception, it should be arithmetic sequences (including constant sequences), since these would be the most common use cases (including 1,2,3,4...). Many times I've wanted to iterate over the multiples of N. I think "1,1,..*" or "2,4,6,..*" is just fine, and overspecifying the sequences by one more seems overkill because the rule is so simple/common.

        blokhead

Re: Challenge: Simple algorithm for continuing series of integers
by BrowserUk (Patriarch) on Oct 19, 2008 at 20:43 UTC

    As a purely intellectual challenge, provided that you recognise that any solution will be severally restricted in the sequences it will recognise, and will inevitably mis-categorise some high 90s percentage of the possiblities, then it is vaguely interesting.

    As an expenditure of Perl6 development time, it is a complete waste.

    This is because it would take an immense amount of effort (and code) to do anything more than a 'piss poor job'. And even if that effort was expended and the code incorporated, and it was possible for the computer to intuit the sequence behind (say):

    my @sequence = 1, 18, 4, 13 .. *;

    The human being (that mythical maintanence programmer), will unlikely be able to do the same!

    And almost everyone is going to code that as:

    my @sequence = 1, 18, 4, 13 .. *; ## dartboard

    Which means it would be far simpler and far better to avoid the "magical" and just incorporate (or have a module that supplies), a few pre-defined sequences like:

    my @odds = ODDS; ## or LAZY_ODDS or LAZY_ODDS_FROM( 11 ) my @evens = EVENS; my @board = DARTBOARD;

    A simple module that supplies a bunch of these 'well named sequences' would be easy to code and infinitely extensible and keep the need for potentially buggy and always limited 'magical' code out of the core.

    Building, testing and verifying a module that contained/exported the entire OEIS database manually would probably take far less time, than trying to write code to divine 1% of them. And it probably would not take a great deal of effort to write code to query the OEIS database directly, and generate the module automatically.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I fully agree.

      Note that "magical guessing" the sequence makes it quite hard to improve the ... * operator.

      Suppose in perl 6.0, you have such a magical ... * operator, that "guesses" how a sequence continues. Now, for 6.1, someone comes up with an idea to "improve" the operator, allowing it to "better guess" certain classes of sequences. But how would you know such an improvement wouldn't break someone codes? The "improvement" is going to change the guess of certain start sequences (otherwise, there wouldn't be such an improvement). But someone else may be quite content with the 6.0 guess. So this isn't an operator that could easily "tune" itself over releases - unless you don't care about backwards compatability.

        It would be relatively easy to make a rule that new guessers are allowed to work only on the unguessable residue left over from previous versions. And of course, there's little reason to install guessers that would be opaque to most readers, and I expect that the writer of the program should certainly name the series or the generator function such that it is clear to the reader.

        So actually, I was serious in my proposal that we make it easy to pull well-named generator functions from a well-known database, and then the numbers on the front are just examples of the first few values in the series--the real meat is in the function name. But a name is a useful abstraction, compared to forcing people to write complicated closures that might be just as unreadable as * is. And the standard function derived in a standard fashion may well be more optimizable than the user-written equivalent, where various idiosyncratic usages may make optimization difficult.

        So that's why the spec currently limits * guessing to some very simple cases, and I have no problem with taking a very conservative approach to adding more guessers.

      Well, I like the challenge as is, but I also think that the magical whatever * has not much practical use - or I cannot imagine one yet but for the simplest sequences.

      Which means it would be far simpler and far better to avoid the "magical" and just...
      Agreed, there would be too much room for ambiguity. Although TIMTOWTDI is part of the philosophy (* vs. {...}), I guess we will finally end with YABP (BP=best practice):

      The Series Operator ...

      • Never use the magical whatever (*).
      • Define arithmetic and geometric series with an explicit closure.

      But it is late and my visionary powers are dwindling ... ;-)

Re: Challenge: Simple algorithm for continuing series of integers
by JavaFan (Canon) on Oct 19, 2008 at 19:57 UTC
    One problem with this is that for given start sequence, there are uncountable sequences for which that is a start sequence. And your criteria are quite subjective. I, for one, claim that
    sub series {0;}
    satisfies at least the four last criteria, and I can argue it satisfies the first one as well.

    But this is probably not what you want.

      Well, your reply is interesting, but not very helpful.

        Note that for any given series of numbers a1 .. ak there's a polynomial f(x) of degree less then or equal to k-1 such that f(1) == a1, f(2) == a2, etc. For obvious reasons, all the coefficients of f(x) are rational (otherwise, f(x) cannot be an integer for integer x).

        Given that there's at least one function that uses only the given operations, and a finite number of such operations, there's also one that uses the smallest number of operators. But it's not necessarily unique.

        I don't know whether this is helpful.

Re: Challenge: Simple algorithm for continuing series of integers
by dragonchild (Archbishop) on Oct 20, 2008 at 17:21 UTC
    The '*' idea is absolutely horrible. Nice in theory, but will suck in practice. If it's implemented, it should be in terms of well-specific sequences. No derivations possible. If the sequence isn't one of the 20-30 that are listed in the manual, then that should throw an error. My reasoning goes like this: How do you know what the next element of 1,1,2,3 is? What about 1,1,1,3? 1,2,4 also has dozens of options. There's just too many possibilities for mistakes.

    Remember - predictability is good. Guessing is bad.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: Challenge: Simple algorithm for continuing series of integers
by kyle (Abbot) on Oct 20, 2008 at 19:09 UTC

    I more or less agree with most of the criticisms of this idea that I've read, but I felt like writing a solution anyway.

    Note that this doesn't pass all tests or even some of the ones that look easy. I have series() written to look at a list of sub references and ask each one for a "next in list" for the proposed list. If exactly one of them gets an answer, I return that answer, otherwise undef.

    If the list of guessers becomes large or expensive, it would be more efficient to ask each one and fail as soon as two return answers.

    It's easy to add other guessers, and it won't ever give an answer if the question is ambiguous. This may be odd. The series "0, 0" has an answer because it won't detect as geometric, but "1, 1" does not have an answer because it could be either one, even though the next in sequence is '1' both ways.

Re: Challenge: Simple algorithm for continuing series of integers
by AnomalousMonk (Archbishop) on Oct 20, 2008 at 21:59 UTC
    Interesting problem. As others have written, I may have doubts about the basic idea, but I, also, couldn't resist...

    Here's what I've been able to come up with. (Is this the proper place to post? Maybe my scratchpad...?)

    .pm file:

    .t file:

    Update: I should perhaps have noted that all tests in the .t file pass.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://718080]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2024-03-28 19:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found