decomposing binary matrices (2)

hv has asked for the wisdom of the Perl Monks concerning the following question:

Update: I now have a working algorithm in Re^2: decomposing binary matrices (2) below; the rest is a SMOO.

After getting a variety of useful leads in the discussion at decomposing binary matrices, I've thought about this a bit more. I think I can simplify the problem a bit by splitting into two parts - first cleaning the matrix up to remove all the useless '1'-bits, and then (if that still seems useful) to split the matrix into discrete submatrices.

To recap and amplify the problem: I have a matrix of booleans in which rows represent variables and columns represent values, in which a value of TRUE (1) means "this variable may take this value", and FALSE (0) means "this variable cannot take this value". (Note: the rows/columns were transposed in my original post.) We have the additional constraint of uniqueness - the value of each variable is distinct from that of every other value.

The task is a) "cleanup" - to gain additional information from the uniqueness constraint, by identifying TRUE values for variable/value combinations that cannot form a part of any solution, and setting them to FALSE; b) "splitting" - to simplify the matrix (if possible) by splitting it into two or more disjoint submatrices.

For example:

1 2 3 4 5

A 1 1 0 0 0

B 1 1 0 0 0

C 1 1 1 1 1

D 1 1 1 1 1

(by cleanup) becomes

1 2 3 4 5

A 1 1 0 0 0

B 1 1 0 0 0

C 0 0 1 1 1

D 0 0 1 1 1

1 2 3 4 5

A 1 1 0 0 0

B 1 1 0 0 0

C 0 0 1 1 1

D 0 0 1 1 1

(by splitting) becomes

1 2

A 1 1

B 1 1

3 4 5

C 1 1 1

D 1 1 1

Both of these examples represent the same gleaned information: due to the uniqueness constraint, A and B must between them consume the values 1 and 2, so C and D are constrained between them to the values 3, 4 and 5.

In my original formulation I reversed the order of these tasks - I hoped to find an easy way to split the matrix, and have the additional information to be gained from the uniqueness constraint "fall out" as a result. I suspect now that splitting is a hard problem, and may be made easier by doing the cleanup first.

Here is some pseudocode for a recursive procedure to solve the cleanup problem:

define clean [ matrix m ]
  matrix m2 = matrix_of_zeroes(m.dimensions)
  for point p (m.allpoints)
    if m.value(p) and not m2.value(p)
      if not try_solution(m, m2, [ p ])
        m.value(p) = 0
  # by now m and m2 are the same

define try_solution [ matrix m, matrix m2, array of points plist ]
  int row = first_row_not_already_tried(plist)
  if not defined row
    # all rows tried, so this is a solution
    for point p (@plist)
      m2.value(p) = 1
    return true
  for int col (grep column_not_in_plist(plist, $_), m.allcols)
    point p = point_at(row, col)

    if m.value(p)
      if try_solution(m, m2, [ @plist, p ])
        return true
  return false
[download]

The idea here is, for each TRUE value in the original matrix, see if there is any consistent solution that assigns a unique value to every variable; if not, mark it as FALSE instead. Short-circuit some of the work by recording every assignment used whenever a consistent solution is found: none of those needs to be retried by the top level.

I also plan an additional optimisation (not shown) such that if in try_solution() we pick a row that has no TRUE values in the available columns, we signal that fact to the caller - assuming that the caller is also try_solution(), it can then switch to trying that row instead (or propagate further up the call chain).

The biggest savings seem to come from getting a value set in m2 before a point is ever tried in clean(), and there are a couple of thigs I can do to try and maximise the number of times that happens - maybe by selecting a random row in try_solution(), or by preferring not-yet-seen columns, or by seeding it with a batch of random solutions.

There may also be benefits to be gained from an initial transformation of the matrix, to order both the rows and the columns in increasing order of the number of TRUE values.

As to how to implement this, I'm not sure. I suspect PDL might be a good route to go, but I've never used it, and am not sure where to start with it; I also can't see that is has support for a datatype any smaller than "byte", so it may not be appropriate to use in this case. Failing that, I think I might be able to achieve reasonable efficiency encoding the matrix as a list of bit vectors, and making the 'plist' a structure that caches the list of rows and the list of columns used so far.

Even so, the combinatorial explosion may make the whole thing senseless - my initial target to try this out on (after testing) is a 26 x 300 matrix, with a potential number of solutions around comb(300, 26) =~= 2E37. I also hope therefore to identify characteristics of the starting matrix that mean no gain is possible, so that I can skip the effort entirely in the worst cases.

If I can get this working at all usefully, I'll look at the splitting problem again.

Any suggestions for implementation or a better algorithm would be gratefully received.

Hugo

Comment on decomposing binary matrices (2) Download Code

Replies are listed 'Best First'.
Re: decomposing binary matrices (2) by blokhead (Monsignor) on Feb 27, 2007 at 19:08 UTC
If you look at this problem in terms of bipartite matchings (see Re: decomposing binary matrices), then your `clean` routine is simply removing edges that are never used in any maximum matching (a graph theorist would say: a matching that saturates your set of variables). An efficient way to check this is the following: To check if an edge (u,v) is used in some matching, remove (nodes) u & v from the graph and see if the remaining graph has a maximum matching. I think this is essentially what you are doing, but I'm not sure. Now, checking a graph for a maximum matching can be done efficiently (see for example http://www.maths.lse.ac.uk/Courses/MA314/matching.pdf). Well, at least much more efficiently than how you have been doing it by trying all possible solutions. Overall, this process for `clean` would be something like cubic time in the size of your matrix. I'm away from a perl interpreter at the moment, but can maybe offer something a little more concrete later. Unfortunately, Graph doesn't have any pre-packaged thingy for finding matchings. blokhead	[reply]
Re^2: decomposing binary matrices (2) by hv (Prior) on Feb 27, 2007 at 21:01 UTC
Ah yes, I think I understand - I'd looked at the perfect match algorithm before, but discarded it then because at the time I was trying to solve the splitting problem for which I needed specifically a cycle, and it looked like extending the algorithm to require a cycle was going to degenerate it into something no better than the brute force. If I'm no longer looking for a cycle, then I can probably use that algorithm. I'll give it a go, and come back. Update: I do understand now. I had read your approach as "to check the edge uv, remove the edge and check ...", but eventually worked out that couldn't be right, that I had to remove the two vertices instead. What's more, thinking about it in the graph form it is clear to me that after cleanup the split also becomes trivial - the graph simply falls apart into disconnected subgraphs. Thanks, Hugo	[reply]
Re^2: decomposing binary matrices (2) by hv (Prior) on Feb 28, 2007 at 09:22 UTC
Excellent, it works. :) The code below demonstrates and tests the cleanup() method and its dependencies - the rest is just a minimal hack to wrap a test harness around it. I also hacked in `%rev` at the last minute when I realised `%$assign` was the wrong way round for my needs, I still need to refactor that. Read more... (3 kB) Thanks very much for your help, Hugo	[reply] [d/l] [select]
Re: decomposing binary matrices (2) by andye (Curate) on Feb 28, 2007 at 11:04 UTC
Hi Hugo, Re PDL: it is possible to add types, using PDL::Types. Each type needs to be something the C compiler understands, though. It does occur to me that perhaps you could stick with the types that are already there, and pack your flags into a byte (or larger) datatype, then use bitwise operators to work on them. Take a look at bandover() and borover() in PDL::Ufunc. In general: PDL is great and well worth the time spent learning. The chief disadvantage is that occasionally the syntax of PDL threading, and the amount of power that you can put into one statement, can make your head explode. Ensure a plentiful supply of calming tea and biscuits while learning. Also, once you start doing threaded ops on large matrices, it's easy to run out of memory - that's not PDL's fault though. Best, andye	[reply]
Re^2: decomposing binary matrices (2) by hv (Prior) on Feb 28, 2007 at 13:17 UTC
Unfortunately, on this occasion exactly as on a half-dozen previous occasions, by the time I'd determined whether the task was remotely possible with PDL and had some idea of what I'd need to learn to get there, I'd already solved the problem some other way. I think that's mainly an issue of domain - the type of problems I tend to deal with are rarely a close fit to the sort of problem PDL is primarily designed to solve. That leaves me never quite sure whether the benefit (to me) would justify the effort of climbing its learning curve, though I recognise that within its domain it is clearly a useful and powerful tool. Hugo	[reply]
Re^3: decomposing binary matrices (2) by andye (Curate) on Feb 28, 2007 at 13:26 UTC
Yeah, I know what you mean. I found the book() "PDL: Scientific Programming in Perl" a useful, and fairly gentle, introduction - you can always read it in the bath sometime when you don't have a particular use for PDL - then when you do, you can fish the knowledge out of the back of your brain. That's my technique, anyway. :) Best, andye An online book - not a printed 'book' book. I don't have the URL handy but I'm sure Google will find it for you.	[reply]
Re: decomposing binary matrices (2) by Moron (Curate) on Feb 28, 2007 at 11:06 UTC
All data is topologically a matrix of bits, so it might be an idea to pack 'b', your matrices into simple integer arrays and map ^= onto the resulting integers to do the cleaning. The splitting could then be reduced to a question of mapping or grepping == onto the integers and you could leave the unpack to the very end of the algorithm to return it back in matrix form. -M Free your mind	[reply]


Perl: the Markov chain saw
	PerlMonks