Efficient algorithm needed to split Set::IntSpan objects

tim.bunce has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient algorithm needed to split Set::IntSpan objects by blokhead (Monsignor) on Sep 23, 2009 at 14:00 UTC
I think the first step would be to split ranges which overlap other ranges into separate non-overlapping ranges: Why do you need to split them into non-overlapping segments before doing $hash{$item}++ ? Also, are you absolutely sure that iterating over the elements will really be a bottleneck? It seems likely that you won't be able to gain much efficiency since the obvious algorithm is so simple. Anyway, here is a sparser way to represent this problem: You can represent the set of ranges by just keeping track of places where the # of intervals changes, so that `00112233222000` [download] becomes `(3,1), (5,2), (7,3), (9,2), (12,0)` [download] In other words, if (i,j), (m,n) are adjacent in this list, then there are j ranges that cover element i to element m-1. This list is sparse, and its size only depends on the number of ranges, not the number of their elements. To query this list on a number (to see how many ranges cover a point x), you can do a binary search to find the largest number < x in the list. That entry in the list will tell you how many ranges cover x. To construct the list, you can do $delta{$start}++, $delta{$end}--, for every ($start,$end) interval (I chose a hash because it can stay sparse if the intervals are large). Then you can iterate through the sorted keys of %delta and make a running total. `my @intervals = ([3,11], [5,8], [7,11]); for (@intervals) { my ($start,$end) = @$_; $delta{$start}++; $delta{$end+1}--; } my $total = 0; my @points; for (sort { $a <=> $b } keys %delta) { next if $delta{$_} == 0; ## update: added this line $total += $delta{$_}; push @points, [$_, $total]; }` [download] Again, this is much more efficient in the theoretical sense (to generate the data structure takes O(n log n), where n is the # of intervals, compared to O(nt) where t is the average size of an interval), but maybe not much of a gain for you depending on the actual sizes of things involved (and depending on what kind of queries you want to make to the data structure). Querying the data structure is a tradeoff, it is now O(log n) instead of constant had you gone the route of iterating through all the elements of the intervals. blokhead	[reply] [d/l] [select]
Re: Efficient algorithm needed to split Set::IntSpan objects (merge sort) by tye (Sage) on Sep 23, 2009 at 14:46 UTC
Construct the sorted list of starting points and the sorted list of ending points. Do a "merge sort" on the two resulting sets but increment a count when selecting a starting number and decrement it when selecting an ending number. Track the last number selected and the count. If the new number doesn't match the previous number, then store the previous number and count. When you are done you'll have a list of starting numbers and counts. If you end up with consecutive identical counts, then you can drop all but the first of that run. Now you've got the starting number of each span and the number of repeats for that span (and a terminating end number with a count of 0). Update: The "ending number" needs to be the first number not in the span. So an int span is defined by "$i where $start <= $i < $end". - tye	[reply]
Re: Efficient algorithm needed to split Set::IntSpan objects by BrowserUk (Patriarch) on Sep 23, 2009 at 19:12 UTC
As the old joke goes: if I were setting out to get there, I wouldn't be starting from here! Presumably you do not magically start out with these thousands of Set::IntSpan objects. You must be building them up from the log file(s)? If so, it would be far more efficient to build up your required final representation, at the same time, (or in place of), as you are iterating the log and generating those objects. The logic is far simpler and falls out naturally from the ordered nature of the log: `my @timeSlots; my $concurrent = 0; my $lastTimeSlot = 0; while( <$log> ) { my( $time, $type ) = m[(...).+(Start\|End)]; ++$concurrent if $type eq 'Start'; --$concurrent if $type eq 'End'; my $timeSlot = timeToTimeslot( $time ); push @timeSlots, ( $concurrent ) x ( $timeSlot - $lastTimeSlot ); $lastTimeSlot = $timeSlot; ## Do stuff to build your intspan objects ... }` [download] By the time you've iterated the logs and generated the IntSpan objects, you already have your array of concurrency counts per timeslot for free. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP PCW It is as I've been saying!(Audio until 20090817)	[reply] [d/l]
Re: Efficient algorithm needed to split Set::IntSpan objects by jwkrahn (Abbot) on Sep 23, 2009 at 13:55 UTC
`$ perl -le' my $x = "--XXXXXXXXX---"; my $y = "----XXXX------"; my $z = "------XXXXX---"; tr/-X/01/ for $x, $y, $z; print for $x, $y, $z; my $m = 0 x length $x; $m =~ s/(.)/ $1 + substr $_, $-[0], 1 /eg for $x, $y, $z; print for "", $m; ' 00111111111000 00001111000000 00000011111000 00112233222000` [download]	[reply] [d/l]
Re: Efficient algorithm needed to split Set::IntSpan objects by vitoco (Hermit) on Sep 23, 2009 at 14:46 UTC
c1 is overlapping b and a2 ranges. How many keys do you expect the hash to have? This approach could be difficult to handle... If you want to add many spans, you will end with many rows with only one X for each span. Also, if you want to add more than 9 spans with an X in the same position, the result string won't be enough.	[reply]
Re^2: Efficient algorithm needed to split Set::IntSpan objects by ikegami (Patriarch) on Sep 23, 2009 at 15:21 UTC
Also, if you want to add more than 9 spans with an X in the same position, the result string won't be enough. Easily fixed by using `chr($x)` instead of `$x`. That will allow 2^32 overlaps. Maybe even 2^64 on 64-bit systems. You might have to turn off some unicode warning.	[reply] [d/l] [select]
Re: Efficient algorithm needed to split Set::IntSpan objects by jethro (Monsignor) on Sep 23, 2009 at 15:03 UTC
Instead of splitting you could do a sort of accumulating using the set operations of Set::IntSpan: First create a result-set 0 which would have (in the example) a span of 1,14. Adding set a would create a result-set 1 with span 3-11 and result-set 0 with 1-2,12-14. This is easily done with set operations available in Set::IntSpan. For adding b you would create the union of result-set 0 and b, which is empty. Then you would create the union of result-set 1 and b, which is 5-8, subtract that from b and result-set 1, and add the range to result-set 2. `Before: XXXXXXXXXXXXXX = result-set 0 -------------- = result-set 1 -------------- = result-set 2 After adding a: XX---------XXX = result-set 0 --XXXXXXXXX--- = result-set 1 -------------- = result-set 2 After adding b: XX---------XXX = result-set 0 --XXX---XXX--- = result-set 1 -----XXX------ = result-set 2` [download] Whenever you add another set, you look for the union of your set and all the already created result-sets: If there is a range common to the set and result-set X, that range will be subtracted from both sets and added to the result-set X+1, until your set is empty Naturally this means if you have a value where all sets overlap, this will create lots of result-sets and make the algorithm slow again. If not it should be reasonable fast if set operations in Set::IntSpan are efficient as well	[reply] [d/l]
Re: Efficient algorithm needed to split Set::IntSpan objects by NiJo (Friar) on Sep 23, 2009 at 17:59 UTC
When iterating over all elements is not an option, you could use a sparse array (i. e. a hash). The hash stores the 'differential' of the spans. Your first line would be stored as: `3 => 1 11 => -1` [download] Of course your implementation has to loop once with ++ and -- over the spans. After sorting by keys the output, i. e. 're-integration', becomes trivial(++ and -- of a counter). A table of x and y values (only when there are changes) might be a better output of your box plot.	[reply] [d/l]


laziness, impatience, and hubris
	PerlMonks