note jcb <p>This was a fun exercise and here is a script that produces the expected results (in sorted order) and should be considerably faster than a full brute force search:</p> <code> #!/usr/bin/perl # -*- CPerl -*- use strict; use warnings; # Goals: # * Find a set of shared substrings for a set of input strings such that: # ** Each substring is at least 3 characters long. # ** Minimize total substring length, counting each substring as extra 2. use constant MIN_SUBSTRING_LEN => 3; use constant PER_SUBSTRING_OVERHEAD => 2; # sample input: my @list=("set abcde-efghi 12345", "set abcde-ijkl 12345", "clr abcde-efghi+123", "clr abcde-ijkl 12345"); # sample output: my @expected_substrings=("set","clr"," abcde-","efghi", "ijkl"," 12345","+123"); # cost of a solution set sub cost (@) { my \$cost = PER_SUBSTRING_OVERHEAD * scalar @_; \$cost += length shift while @_; return \$cost } # algorithm: # attempt to split common prefixes and suffixes into separate substrings; # terminate when this is no longer possible my @substrings = @list; my \$made_progress = 1; my \$last_output = ''; # find common prefixes # returns [ <prefix>, <tail>... ]... sub partition (@) { my @strings = sort @_; my @bins = (); my \$prefix = \$strings; for (my \$i = 0; \$i < @strings; \$i++) { next if \$prefix eq substr(\$strings[\$i], 0, length \$prefix); my \$new_prefix = \$prefix; \$new_prefix = substr \$new_prefix, 0, -1 while length \$new_prefix and \$new_prefix ne substr(\$strings[\$i], 0, length \$new_prefix); if (length \$new_prefix < MIN_SUBSTRING_LEN and @strings) { push @bins, [\$prefix, map {substr \$_, length \$prefix} splice @strings, 0, \$i]; \$i = 0; \$prefix = \$strings; } else { \$prefix = \$new_prefix; } } push @bins, [\$prefix, map {substr \$_, length \$prefix} splice @strings] if @strings; return @bins } while (\$made_progress) { # find prefixes my %new_substrings = (); my @bins = partition @substrings; \$new_substrings{\$_}++ for map {@\$_} @bins; @substrings = sort keys %new_substrings; # repeat for suffixes %new_substrings = (); @bins = partition map scalar reverse, @substrings; \$new_substrings{\$_}++ for map {@\$_} @bins; @substrings = grep length, sort map scalar reverse, keys %new_substrings; \$made_progress = (\$last_output ne join(':', @substrings)); \$last_output = join(':', @substrings); } print "results: (cost ",cost(@substrings),")\n"; print \$_, "\n" for @substrings; </code> <p>This script does not really try to produce a minimal-cost result set at all &mdash; it simply produces <i>a</i> solution quickly by repeatedly "peeling off" common prefixes and suffixes. The same <c>sub partition</c> is used for both, by simply reversing the strings to make suffixes into prefixes. It works by finding a common prefix, reducing that prefix while traversing the sorted input, and ending a group when the prefix is below the threshold length.</p> <p>(thanks to [LanX] for the reminder to use a hash for unique keys)</p> 11118281 11118281