Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Challenge: Fast Common Substrings

by lima1 (Curate)
on Apr 04, 2007 at 15:50 UTC ( [id://608306]=note: print w/replies, xml ) Need Help??


in reply to Challenge: Fast Common Substrings

Just for the sake of completeness: A fast and elegant algorithm for this is a tricky use of suffix trees. One concatenates the two strings of length n and m, say abcdef%efgab$. It is possible to construct a suffix tree of this string in O(n+m) (Ukkonen algorithm). To find the common substrings, one has then to search for nodes that have exactly two (or the number of strings) leafs belonging to the different words. The resulting suffix tree for "abcdef" and "efgab":
| |(3:cdef%efgab$)|leaf |(1:ab)| | |(13:$)|leaf tree:| | |(3:cdef%efgab$)|leaf |(2:b)| | |(13:$)|leaf | |(3:cdef%efgab$)|leaf | |(4:def%efgab$)|leaf | | |(7:%efgab$)|leaf |(5:ef)| | |(10:gab$)|leaf | | |(7:%efgab$)|leaf |(6:f)| | |(10:gab$)|leaf | |(7:%efgab$)|leaf | |(10:gab$)|leaf |
So "ab" has two leafs in the different words (position <= 7 for leaf 1 and position > 7 for leaf 2). So have 'b', 'ef' and 'f'.

http://en.wikipedia.org/wiki/Longest_common_substring_problem

Update: Just found some perl code with google ... on perlmonks ;) Re: finding longest common substring

Replies are listed 'Best First'.
Re^2: Challenge: Fast Common Substrings
by blokhead (Monsignor) on Apr 04, 2007 at 16:02 UTC
    ++ Wow, thank you for introducing me to suffix trees. What an interesting concept, and how refreshing to see a linear-time algorithm for constructing such a creature. I see you've used the javascript applet at this page, which others may want to check out.

    However, I'd like to slightly revise the algorithm you outlined. Consider the following example:

    string = ababc%bc$ | |(3:abc%bc$)|leaf |(1:ab)| | |(5:c%bc$)|leaf tree:| | |(3:abc%bc$)|leaf |(2:b)| | | |(6:%bc$)|leaf | |(5:c)| | | |(9:$)|leaf | | |(6:%bc$)|leaf |(5:c)| | |(9:$)|leaf | |(6:%bc$)|leaf | |(9:$)|leaf
    "ab" appears twice in the first string, and so it gives a node with two leaves. The actual condition you should check is whether a node has one leaf containing the % separator and another leaf without the % symbol.

    blokhead

      The page you link to mentions being able to build them in O(n) but then only really describes how to go from a suffix tree for string $x to one for string $x.$c (1==length$c) in O(length $x). Using that algorithm would require O(N*N) to build the suffix tree for a string of length N.

      So I'm not sure I believe the O(N) claim for building the whole suffix tree based on that page.

      - tye        

        The naive algorithm requires O(N*N). The Ukkonen algorithm needs only O(N). If you want to understand it - it is not trivial - I recommend Gusfields book (Algorithms on Strings,...).
      Or even easier: check the positions of the substrings (<=7 and > 7 in my example).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://608306]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-03-28 23:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found