Re: Best method to eliminate substrings from array

How about something like this (this does depend on them being sorted shortest to longest):

my @data = sort { length($a) <=> length($b) } qw(
    2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|27578
+03
    3419308|3514531|3525716|3557019|3586192|3635776|3783741
    3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854
    3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854|8W1152|8R0721
    3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|13369
+34
    4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377
    4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788
    6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|275780
+3
    6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|13369
+34
    6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377
    6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1
+F7854|8W1152
);
my %uniq;

foreach my $elem (@data) {
    my @parts = split /\|/, $elem;
    foreach my $p (0 .. $#parts) {
        my $e = join '|', @parts[0..$p];
        delete $uniq{$e} if exists $uniq{$e};
    }
    $uniq{$elem} = 1;
}

print "$_\n" for keys %uniq;
[download]

Comment on Re: Best method to eliminate substrings from array Download Code

Replies are listed 'Best First'.
Re^2: Best method to eliminate substrings from array by AnomalousMonk (Archbishop) on Jun 28, 2019 at 18:28 UTC
Some comments: In your code here, an input part number group item is treated as a subset of another group only if it is anchored at the left end of the larger group. E.g., the items `7K3377\|3H5788 8W1152 4P0489\|2757803` added to the list of test input data will not be excluded from output, but, of course, `2N0472\|6N8595 2N0472` will be. In the OPed code, the `if`-block `if ($strChain ne $_ && index($_, $strChain) >= 0) { $found = true; last; }` implies that a part number group is a subset if it is found anywhere (per the `>=` comparison) in the larger group (and is not identical to the larger group). Additionally, the OPed code implies that duplicated items in the input appear unchanged in the output (if they are not part of any larger group), e.g., `123 ... 123` in the input would appear as `123 ... 123` in the output. In your code, these items would be made unique. Also, the OPed code would produce output in the same order as the input items (less subsets), although this implied requirement seems less imperative than the others. Because it's taken directly from a hash, your code will produce output in random order. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Best method to eliminate substrings from array by Paladin (Vicar) on Jun 28, 2019 at 21:39 UTC
Here OP says that the list of part numbers are to be treated as sets/subsets, so while the original code matches sub-strings, OP says later that is incorrect. My code treats the long strings as ordered sets, which seems to be what the OP wanted. If the OP really wants to treat the list of parts as a non-ordered set, it's easy enough to add a `sort` to the `join` line. OP also says here they are sorting the original list anyways, so the input order seems to be irrelevant. I'm not quite sure what you mean by the duplicated items part. Essentially what my code does is break each line (set), into individual part numbers (elements), then checks if for each prefix of elements, does that one already exist in the final result, and if it does, remove it from the final result, as this current line will supersede it. So if the current line was "A\|B\|A\|B\|C", it first checks if "A" is in the result; If so, remove it. Then checks "A\|B", then "A\|B\|A", etc. until finally adding the entire line "A\|B\|A\|B\|C" to the final result. If later in the file, the line "A\|B\|A\|B\|C\|N" is found, at that point, the "A\|B\|A\|B\|C" would get removed.	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks