Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Efficient algorithm needed to split Set::IntSpan objects

by tim.bunce (Scribe)
on Sep 23, 2009 at 13:22 UTC ( #796962=perlquestion: print w/replies, xml ) Need Help??

tim.bunce has asked for the wisdom of the Perl Monks concerning the following question:

I humbly seek the wisdom of the Monks.

Given a list of Set::IntSpan objects I'd like to find the number of overlapping ranges for any given value. For example, given the three ranges represented here:

  --XXXXXXXXX---  a = 3-11
  ----XXXX------  b = 5-8
  ------XXXXX---  c = 7-11

I'd like to end up with these values:

  00112233222000

For my use-case I know that each span is just one contiguous range - there are no holes in the spans. Also, the spans may be large so iterating over the elements of a span isn't an option.

I think the first step would be to split ranges which overlap other ranges into separate non-overlapping ranges:

  --XX---------- a1 = 3-4
  ----XXXX------ a2 = 5-8
  --------XXX--- a3 = 9-11
  ----XXXX------ b  = 5-8
  ------XX------ c1 = 7-8
  --------XXX--- c2 = 9-11

then a simple $hash{ $stringified_span }++ would give me the counts.

Any suggestions for how to do this efficiently? Or a different approach? (I've thousands of spans representing events with durations from a log)

Replies are listed 'Best First'.
Re: Efficient algorithm needed to split Set::IntSpan objects
by blokhead (Monsignor) on Sep 23, 2009 at 14:00 UTC
    I think the first step would be to split ranges which overlap other ranges into separate non-overlapping ranges:
    Why do you need to split them into non-overlapping segments before doing $hash{$item}++ ?

    Also, are you absolutely sure that iterating over the elements will really be a bottleneck? It seems likely that you won't be able to gain much efficiency since the obvious algorithm is so simple. Anyway, here is a sparser way to represent this problem:

    You can represent the set of ranges by just keeping track of places where the # of intervals changes, so that

    00112233222000
    becomes
    (3,1), (5,2), (7,3), (9,2), (12,0)
    In other words, if (i,j), (m,n) are adjacent in this list, then there are j ranges that cover element i to element m-1. This list is sparse, and its size only depends on the number of ranges, not the number of their elements.

    To query this list on a number (to see how many ranges cover a point x), you can do a binary search to find the largest number < x in the list. That entry in the list will tell you how many ranges cover x.

    To construct the list, you can do $delta{$start}++, $delta{$end}--, for every ($start,$end) interval (I chose a hash because it can stay sparse if the intervals are large). Then you can iterate through the sorted keys of %delta and make a running total.

    my @intervals = ([3,11], [5,8], [7,11]); for (@intervals) { my ($start,$end) = @$_; $delta{$start}++; $delta{$end+1}--; } my $total = 0; my @points; for (sort { $a <=> $b } keys %delta) { next if $delta{$_} == 0; ## update: added this line $total += $delta{$_}; push @points, [$_, $total]; }

    Again, this is much more efficient in the theoretical sense (to generate the data structure takes O(n log n), where n is the # of intervals, compared to O(nt) where t is the average size of an interval), but maybe not much of a gain for you depending on the actual sizes of things involved (and depending on what kind of queries you want to make to the data structure). Querying the data structure is a tradeoff, it is now O(log n) instead of constant had you gone the route of iterating through all the elements of the intervals.

    blokhead

Re: Efficient algorithm needed to split Set::IntSpan objects (merge sort)
by tye (Sage) on Sep 23, 2009 at 14:46 UTC

    Construct the sorted list of starting points and the sorted list of ending points. Do a "merge sort" on the two resulting sets but increment a count when selecting a starting number and decrement it when selecting an ending number. Track the last number selected and the count. If the new number doesn't match the previous number, then store the previous number and count. When you are done you'll have a list of starting numbers and counts. If you end up with consecutive identical counts, then you can drop all but the first of that run. Now you've got the starting number of each span and the number of repeats for that span (and a terminating end number with a count of 0).

    Update: The "ending number" needs to be the first number not in the span. So an int span is defined by "$i where $start <= $i < $end".

    - tye        

Re: Efficient algorithm needed to split Set::IntSpan objects
by BrowserUk (Pope) on Sep 23, 2009 at 19:12 UTC

    As the old joke goes: if I were setting out to get there, I wouldn't be starting from here!

    Presumably you do not magically start out with these thousands of Set::IntSpan objects. You must be building them up from the log file(s)?

    If so, it would be far more efficient to build up your required final representation, at the same time, (or in place of), as you are iterating the log and generating those objects. The logic is far simpler and falls out naturally from the ordered nature of the log:

    my @timeSlots; my $concurrent = 0; my $lastTimeSlot = 0; while( <$log> ) { my( $time, $type ) = m[(...).+(Start|End)]; ++$concurrent if $type eq 'Start'; --$concurrent if $type eq 'End'; my $timeSlot = timeToTimeslot( $time ); push @timeSlots, ( $concurrent ) x ( $timeSlot - $lastTimeSlot ); $lastTimeSlot = $timeSlot; ## Do stuff to build your intspan objects ... }

    By the time you've iterated the logs and generated the IntSpan objects, you already have your array of concurrency counts per timeslot for free.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Efficient algorithm needed to split Set::IntSpan objects
by jwkrahn (Monsignor) on Sep 23, 2009 at 13:55 UTC
    $ perl -le' my $x = "--XXXXXXXXX---"; my $y = "----XXXX------"; my $z = "------XXXXX---"; tr/-X/01/ for $x, $y, $z; print for $x, $y, $z; my $m = 0 x length $x; $m =~ s/(.)/ $1 + substr $_, $-[0], 1 /eg for $x, $y, $z; print for "", $m; ' 00111111111000 00001111000000 00000011111000 00112233222000
Re: Efficient algorithm needed to split Set::IntSpan objects
by vitoco (Friar) on Sep 23, 2009 at 14:46 UTC

    c1 is overlapping b and a2 ranges.

    How many keys do you expect the hash to have?

    This approach could be difficult to handle... If you want to add many spans, you will end with many rows with only one X for each span.

    Also, if you want to add more than 9 spans with an X in the same position, the result string won't be enough.

      Also, if you want to add more than 9 spans with an X in the same position, the result string won't be enough.

      Easily fixed by using chr($x) instead of $x. That will allow 2^32 overlaps. Maybe even 2^64 on 64-bit systems. You might have to turn off some unicode warning.

Re: Efficient algorithm needed to split Set::IntSpan objects
by jethro (Monsignor) on Sep 23, 2009 at 15:03 UTC

    Instead of splitting you could do a sort of accumulating using the set operations of Set::IntSpan:
    First create a result-set 0 which would have (in the example) a span of 1,14. Adding set a would create a result-set 1 with span 3-11 and result-set 0 with 1-2,12-14. This is easily done with set operations available in Set::IntSpan.

    For adding b you would create the union of result-set 0 and b, which is empty. Then you would create the union of result-set 1 and b, which is 5-8, subtract that from b and result-set 1, and add the range to result-set 2.

    Before: XXXXXXXXXXXXXX = result-set 0 -------------- = result-set 1 -------------- = result-set 2 After adding a: XX---------XXX = result-set 0 --XXXXXXXXX--- = result-set 1 -------------- = result-set 2 After adding b: XX---------XXX = result-set 0 --XXX---XXX--- = result-set 1 -----XXX------ = result-set 2

    Whenever you add another set, you look for the union of your set and all the already created result-sets: If there is a range common to the set and result-set X, that range will be subtracted from both sets and added to the result-set X+1, until your set is empty

    Naturally this means if you have a value where all sets overlap, this will create lots of result-sets and make the algorithm slow again. If not it should be reasonable fast if set operations in Set::IntSpan are efficient as well

Re: Efficient algorithm needed to split Set::IntSpan objects
by NiJo (Friar) on Sep 23, 2009 at 17:59 UTC
    When iterating over all elements is not an option, you could use a sparse array (i. e. a hash). The hash stores the 'differential' of the spans. Your first line would be stored as:

    3 => 1 11 => -1
    Of course your implementation has to loop once with ++ and -- over the spans. After sorting by keys the output, i. e. 're-integration', becomes trivial(++ and -- of a counter). A table of x and y values (only when there are changes) might be a better output of your box plot.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://796962]
Approved by marto
Front-paged by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2020-10-26 07:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (250 votes). Check out past polls.

    Notices?