Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: Best method to eliminate substrings from array

by Paladin (Vicar)
on Jun 26, 2019 at 18:27 UTC ( [id://11101982] : note . print w/replies, xml ) Need Help??

in reply to Best method to eliminate substrings from array

How about something like this (this does depend on them being sorted shortest to longest):
my @data = sort { length($a) <=> length($b) } qw( 2N0472|6N8595|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|27578 +03 3419308|3514531|3525716|3557019|3586192|3635776|3783741 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854 3T3625|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721 3T3628|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|13369 +34 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377 4N4906|6N6481|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788 6N7936|6N5049|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|275780 +3 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152|8R0721|9C5344|6W6672|9G7101|3023908|6Y1352|4P0489|13369 +34 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377 6Y0248|6T7765|9L1366|1189902|1413983|8B2026|1M3381|7K3377|3H5788|1 +F7854|8W1152 ); my %uniq; foreach my $elem (@data) { my @parts = split /\|/, $elem; foreach my $p (0 .. $#parts) { my $e = join '|', @parts[0..$p]; delete $uniq{$e} if exists $uniq{$e}; } $uniq{$elem} = 1; } print "$_\n" for keys %uniq;

Replies are listed 'Best First'.
Re^2: Best method to eliminate substrings from array
by AnomalousMonk (Archbishop) on Jun 28, 2019 at 18:28 UTC

    Some comments:

    • In your code here, an input part number group item is treated as a subset of another group only if it is anchored at the left end of the larger group. E.g., the items  7K3377|3H5788 8W1152 4P0489|2757803 added to the list of test input data will not be excluded from output, but, of course,  2N0472|6N8595 2N0472 will be.

      In the OPed code, the if-block
          if ($strChain ne $_ && index($_, $strChain) >= 0) { $found = true;  last; }
      implies that a part number group is a subset if it is found anywhere (per the  >= comparison) in the larger group (and is not identical to the larger group).

    • Additionally, the OPed code implies that duplicated items in the input appear unchanged in the output (if they are not part of any larger group), e.g.,  123 ... 123 in the input would appear as  123 ... 123 in the output. In your code, these items would be made unique.
    • Also, the OPed code would produce output in the same order as the input items (less subsets), although this implied requirement seems less imperative than the others. Because it's taken directly from a hash, your code will produce output in random order.

    Give a man a fish:  <%-{-{-{-<

      Here OP says that the list of part numbers are to be treated as sets/subsets, so while the original code matches sub-strings, OP says later that is incorrect. My code treats the long strings as ordered sets, which seems to be what the OP wanted. If the OP really wants to treat the list of parts as a non-ordered set, it's easy enough to add a sort to the join line.

      OP also says here they are sorting the original list anyways, so the input order seems to be irrelevant.

      I'm not quite sure what you mean by the duplicated items part. Essentially what my code does is break each line (set), into individual part numbers (elements), then checks if for each prefix of elements, does that one already exist in the final result, and if it does, remove it from the final result, as this current line will supersede it. So if the current line was "A|B|A|B|C", it first checks if "A" is in the result; If so, remove it. Then checks "A|B", then "A|B|A", etc. until finally adding the entire line "A|B|A|B|C" to the final result. If later in the file, the line "A|B|A|B|C|N" is found, at that point, the "A|B|A|B|C" would get removed.