http://qs321.pair.com?node_id=417081

bowsie has asked for the wisdom of the Perl Monks concerning the following question:

Hello Holy Ones.
I hope you can help me out here. I've had a few brains scratching over this one, and while there are some solutions, none of them are very good.

I've got a file where each line represents one element, and the entries on the line are the matches to that element. The number of matches is arbitrary. I need to "network" those matches to find a non-redundant set of elements based on at least one element in common between lines. Here's an example:

Infile:
a b c d e
f b g
h i j k l
m f

I want to say that because lines 1 and 2 both contain "b", 1=2, and because lines 2 and 4 contain f, 2=4. Using the concept of if a=b and b=c then a=c, we can say that (1, 2, 4) is one complete set. Line 3 has no matches in common to any of the other lines, and so a second set is (3)

In the end I need a count of these sets.

I know to read the lines into a hash - the keys are the line numbers (for lack of something better) and the values are an array of the elements. I was thinking along the lines of needed to traverse the hash in an outer loop to grab an entry, then an inner loop traverse the hash once more looking for matches. When I find a match I can merge the values between the two hash entries and delete the inner-loop entry.

However, if I do this, the new value defined in the outer-loop hash entry will not be fully traversed, and I will run into problems there.

I have a solution that works in a 2-pass system where the first pass records pairs of line numbers with matches and the second pass networks those pairs. However, it's dead slow and while my little example is, well, little... I have large datasets to handle (think 200,000 lines with up to 10 matches per line!)

Any insights?

Thanks!

Replies are listed 'Best First'.
Re: Building Networks of Matches
by hv (Prior) on Dec 23, 2004 at 14:20 UTC

    I didn't fully understand your suggestion for how you might go about it, but below is the simplistic approach I'd use as a starting point. The key features of the scan are a) to build two data structures simultaneously: one to contain known sets of equivalences, and the second to contain the elements that those sets match; b) when new equivalences are found, the data structures for the equivalent sets are merged.

    If this isn't fast enough, my first thought to improve it would be to find some way of using bit vectors to represent the elements, so that matches can be checked with a bitwise-and of two strings. To do that, you'd need to find a way to translate elements into numbers that you can use as a bit offset.

    However, if there are lots of elements most of which appear only once, it may be better to do a prepass to get a list of repeated elements, and then consider only those repeats in the main loop.

    Hope this helps,

    Hugo

      This is VERY fast and very good - thank you! As for the prepass, I can do that easily with a sort unique in unix. :)

      You're a genius! This may be one of the best Xmas gifts I get this year!

      Thanks!

      Bowsie

Re: Building Networks of Matches
by simonm (Vicar) on Dec 23, 2004 at 17:36 UTC
    I've been looking at the Graph code lately and it occurred to me that this was easily expressed in that form, but I fear that this will not be fast enough for your purposes:

      I hate meetings.

      Slight variation that I was working on before having to attend a meeting. Use add_path rather than add_edges

      use strict; use Graph; use Graph::Undirected; my $ud = Graph::Undirected->new(); while(<DATA>) { my @items = split(' '); if(@items <= 1) { $ud->add_vertex(@items); } else { $ud->add_path(@items); } } foreach my $connectedSet ($ud->strongly_connected_components) { print "Connected Nodes: " . join(',', @$connectedSet) . "\n"; } __DATA__ a b c d e f b g h i j k l m f z

      Update: Modified code based on feedback. Have to handle the case where there is only one letter on a line. UNFORTUNATELY this cause it to lock on strongly_connected_components call.


      "Look, Shiny Things!" is not a better business strategy than compatibility and reuse.


      OSUnderdog
        This doesn't work. :(

        First, you cannot use strongly_connected_components in an undirected graph. However, simply substituting connected_components takes care of that, so that's not a big deal.

        However, it only works when there is more than one element on a line. If my data looked like this:

        __DATA__ a b c d e f b g h i j k l m f z
        it would not deal with the line with just the "z" on it - it would ignore that line and give me the same results as the data-set without the "z" instead of telling me there's a set of 1-ement containing z that is unique to the list.

        I will go read more about Graph and see if there's a solution to this problem. Thank you anyway
Re: Building Networks of Matches
by sasikumar (Monk) on Dec 23, 2004 at 13:59 UTC
    hi

    I am just rigting a mere logic i am not writing a code. please adjust with me as am ina hurry to move out of my office.
    I know u would be smart enough to make it into a working model

    just try this way a reverse hash

    Instead of the linenumber as has and elements in the array reverse this concept u will find a simple solution. Just build a hastable with ur elements as key and their line numbers as hash tabls. This would solve your problem.


    Please let me know your comments on it.

    Thanks
    Sasi Kumar
Re: Building Networks of Matches
by steves (Curate) on Dec 23, 2004 at 14:01 UTC

    My first thought is to think of this as a tree structure. Not that my first thoughts are always right ...

    Suppose you have a hash of root tree nodes. The key to the root node hash is one of your matches. Each node is itself a hash, containing several things: The line number, a (possibly undefined) reference to the next node in the tree, and anything else you may need. As you parse each element (each line), you take each match of that element and look it up by starting at the top of the tree: You do a lookup in the root node hash. If you find a hit, you follow the next node references down to the end, and add a new node. If you don't find a hit, you add a new element to the root hash. I believe this could also be optmized by keeping a last node hash. To get your relationships, you traverse each root node down when you're done.

    Update: Actually, I'm not sure this truly handles all the relationships that elegantly. At a minimum, each node would have to be an array of other (sub)nodes to get all the relationships. It seems like there should be a simpler way to do that.