Best method to eliminate substrings from array

catemp has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Best method to eliminate substrings from array by choroba (Cardinal) on Jun 26, 2019 at 20:09 UTC
If the number of all possible parts isn't too high, you can assign numbers to the parts and use vectors where each positions in the vector says whether the corresponding part is present or not. I used 8 bits to represent each part, but 2 bits should be enough if you need to reduce space (1 bit isn't enough, we need 4 different values, as will be explained shortly). For example, A001\|B002 has already been seen, the assigned numbers are 1 for A001 and 2 for A002. We then read B002\|A001\|C003. C003 will be assigned the number 3, so the stored vector is 011 and the new one is 111. If we use 2 instead of 1 in the stored vectors, we can just bitwise or the two numbers and see what should be done: "22" \| "111" = 331, where 1 means "present in the new only" and 3 means "present in both" (2 would be "present in stored only"). If there are 3's only, we've already seen exactly the same combination of parts. If there's no 1, the new combination was contained in a stored one, if there's no 2, the new combination contains the old one. #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; my $last_part = 1; my (%part, %store); while (my $line = <DATA>) { chomp $line; my @parts = split /\\|/, $line; ! exists $part{$_} and $part{$_} = $last_part++ for @parts; my $string = 0 x ($last_part - 1); substr $string, $part{$_}, 1, 1 for @parts; my %back = reverse %part; my $store = 1; for my $seen (keys %store) { my $result = "$seen" \| "$string"; if ($result =~ /^3+$/ # Same as old. \|\| $result !~ /1/ # Contained in old. ) { $store = 0; } elsif ($result !~ /2/) { # Contains old. delete $store{$seen}; } } undef $store{$string =~ tr/1/2/r} if $store; } say 'Kept: '; my %back = reverse %part; for my $stored (keys %store) { say join '\|', map substr($stored, $_, 1) ? $back{$_} : (), 1 .. length $stored; } __DATA__ A001\|B002 C003\|A001\|B002 B002\|A001 C003\|D004\|A001 E005\|F006 D004\|C003 [download] Update: Switched to bitwise string or from +, so Math::BigInt is not needed. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re: Best method to eliminate substrings from array by kcott (Archbishop) on Jun 27, 2019 at 08:32 UTC
G'day catemp, Welcome to the Monastery. After looking through the thread, I couldn't see any specifics about how you wanted the output ordered. Here's three solutions with different output orders; they become progressively more complicated and, accordingly, will take longer to run. I started with the data you posted below. I then added a few more lines for test purposes: `6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8` [download] appears three times: two adjacent; one separated. `6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788` [download] is a new substring of others starting with `6Y0248` but is separated from them; it's the new first record. Here's the code containing all three solutions: #!/usr/bin/env perl use strict; use warnings; my @in_data = qw{ 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 2N0472\|6N8595\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 3419308\|3514531\|3525716\|3557019\|3586192\|3635776\|3783741 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721 3T3628\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152 }; print "Input data:\n", '-' x 40, "\n"; print "$_\n" for @in_data; { my @out_data; print '-' x 40, "\nOutput data (unordered):\n", '-' x 40, "\n"; my $long_string; for my $string (reverse sort @in_data) { if (defined $long_string) { next if index($long_string, $string) == 0; } $long_string = $string; print "$string\n"; } } { my @out_data; print '-' x 40, "\nOutput data (ordered by first field):\n", '-' x + 40, "\n"; my $long_string; for my $string (reverse sort @in_data) { if (defined $long_string) { next if index($long_string, $string) == 0; } $long_string = $string; push @out_data, $string; } print "$_\n" for sort @out_data; } { my @out_data; print '-' x 40, "\nOutput data (original order):\n", '-' x 40, "\n +"; my $long_string; my $pos = 0; for my $pos_string (sort { $b->[1] cmp $a->[1] } map [ $pos++, $_ +], @in_data) { if (defined $long_string) { next if index($long_string, $pos_string->[1]) == 0; } $long_string = $pos_string->[1]; push @out_data, $pos_string; } print "$_->[1]\n" for sort { $a->[0] <=> $b->[0] } @out_data; } print '-' x 40, "\n"; [download] Note the use of anonymous blocks to limit the scope of identically named variables (e.g. `@out_data` and `$long_string`). Here's the output: Input data: ---------------------------------------- 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 2N0472\|6N8595\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 3419308\|3514531\|3525716\|3557019\|3586192\|3635776\|3783741 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721 3T3628\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152 ---------------------------------------- Output data (unordered): ---------------------------------------- R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 3T3628\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721 3419308\|3514531\|3525716\|3557019\|3586192\|3635776\|3783741 2N0472\|6N8595\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 ---------------------------------------- Output data (ordered by first field): ---------------------------------------- 2N0472\|6N8595\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 3419308\|3514531\|3525716\|3557019\|3586192\|3635776\|3783741 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721 3T3628\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 ---------------------------------------- Output data (original order): ---------------------------------------- 2N0472\|6N8595\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 3419308\|3514531\|3525716\|3557019\|3586192\|3635776\|3783741 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721 3T3628\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8 R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|2757803 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1F785 +4\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|1336934 ---------------------------------------- [download] — Ken	[reply] [d/l] [select]
Re: Best method to eliminate substrings from array by Paladin (Vicar) on Jun 26, 2019 at 18:27 UTC
How about something like this (this does depend on them being sorted shortest to longest): my @data = sort { length($a) <=> length($b) } qw( 2N0472\|6N8595\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854\|8W1152\|8R0721\|9C5344\|6W6672\|�9G7101\|3023908\|6Y1352\|4P0489\|27578 +03 3419308\|3514531\|3525716\|3557019\|3586192\|3635776\|3783741 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854 3T3625\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854\|8W1152\|8R0721 3T3628\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854\|8W1152\|8R0721\|9C5344\|6W6672\|�9G7101\|3023908\|6Y1352\|4P0489\|13369 +34 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377 4N4906\|6N6481\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788 6N7936\|6N5049\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854\|8W1152\|8R0721\|9C5344\|6W6672\|9G7101\|3023908\|6Y1352\|4P0489\|275780 +3 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854\|8W1152\|8R0721\|9C5344\|6W6672\|�9G7101\|3023908\|6Y1352\|4P0489\|13369 +34 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377 6Y0248\|6T7765\|9L1366\|1189902\|1413983\|8B2026\|1M3381\|7K3377\|3H5788\|1 +F7854\|8W1152 ); my %uniq; foreach my $elem (@data) { my @parts = split /\\|/, $elem; foreach my $p (0 .. $#parts) { my $e = join '\|', @parts[0..$p]; delete $uniq{$e} if exists $uniq{$e}; } $uniq{$elem} = 1; } print "$_\n" for keys %uniq; [download]	[reply] [d/l]
Re^2: Best method to eliminate substrings from array by AnomalousMonk (Archbishop) on Jun 28, 2019 at 18:28 UTC
Some comments: In your code here, an input part number group item is treated as a subset of another group only if it is anchored at the left end of the larger group. E.g., the items `7K3377\|3H5788 8W1152 4P0489\|2757803` added to the list of test input data will not be excluded from output, but, of course, `2N0472\|6N8595 2N0472` will be. In the OPed code, the `if`-block `if ($strChain ne $_ && index($_, $strChain) >= 0) { $found = true; last; }` implies that a part number group is a subset if it is found anywhere (per the `>=` comparison) in the larger group (and is not identical to the larger group). Additionally, the OPed code implies that duplicated items in the input appear unchanged in the output (if they are not part of any larger group), e.g., `123 ... 123` in the input would appear as `123 ... 123` in the output. In your code, these items would be made unique. Also, the OPed code would produce output in the same order as the input items (less subsets), although this implied requirement seems less imperative than the others. Because it's taken directly from a hash, your code will produce output in random order. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Best method to eliminate substrings from array by Paladin (Vicar) on Jun 28, 2019 at 21:39 UTC
Here OP says that the list of part numbers are to be treated as sets/subsets, so while the original code matches sub-strings, OP says later that is incorrect. My code treats the long strings as ordered sets, which seems to be what the OP wanted. If the OP really wants to treat the list of parts as a non-ordered set, it's easy enough to add a `sort` to the `join` line. OP also says here they are sorting the original list anyways, so the input order seems to be irrelevant. I'm not quite sure what you mean by the duplicated items part. Essentially what my code does is break each line (set), into individual part numbers (elements), then checks if for each prefix of elements, does that one already exist in the final result, and if it does, remove it from the final result, as this current line will supersede it. So if the current line was "A\|B\|A\|B\|C", it first checks if "A" is in the result; If so, remove it. Then checks "A\|B", then "A\|B\|A", etc. until finally adding the entire line "A\|B\|A\|B\|C" to the final result. If later in the file, the line "A\|B\|A\|B\|C\|N" is found, at that point, the "A\|B\|A\|B\|C" would get removed.	[reply] [d/l] [select]
Re: Best method to eliminate substrings from array by jwkrahn (Abbot) on Jun 26, 2019 at 18:53 UTC
`push(@arrCompletedChains, $strChain);` [download] As it says in perlsyn: `If any part of LIST is an array, "foreach" will get very confused if you add or remove elements within the loop body, for example with "splice". So don't do that.` [download] UNTESTED, but you could see if this is faster: `use List::Util 'first'; # sort largest first my @sorted = sort { length $b <=> length $a } @arrWorkingCompletedChai +ns; my %to_delete; # remove shortest first while ( my $next = pop @sorted ) { if ( first { index( $_, $next ) >= 0 } @sorted ) { ++$to_delete{ $next }; } \| @arrWorkingCompletedChains = grep !$to_delete{ $_ }, @arrWorkingComple +tedChains;` [download]	[reply] [d/l] [select]
Re: Best method to eliminate substrings from array (updated) by AnomalousMonk (Archbishop) on Jun 27, 2019 at 02:14 UTC
Here's an approach based on index. This is presented in a unit-test format (see How to ask better questions using Test::More and sample data); if you have a test case that's problematic, please let me know it using the test-case syntax detailed in the code below (see the `@Tests` array). I haven't tried this code with an input list of a million 250-character part number groups, but I think it should fit into a laptop-ish memory footprint! I've made no attempt to Benchmark this; you're welcome to do so, and if you do, please let us know the results! File `toss_included_substrings_1.pl`: Read more... (7 kB) Update: Note on Output Order. The output order of the `no_subgroups()` function is similar to that for List::MoreUtils`::uniq()` (also in recent versions of List::Util): identical to the input order except that "subgroup" items are removed from the stream. The one question I have concerns identical repeats of part numbers or part number groups. ~~The~~ \| My current code allows repeated identical items to pass through unmolested (update: the OPed code suggests this specification). In the notation of the test code: `[ [ qw(foo foo) ] => [ qw(foo foo) ] ]` But whether an item can be a subgroup of itself is unclear to me. Two other possibilities for handling such repeats would be: `[ [ qw(foo foo) ] => [ qw(foo) ] ]` Repeats made unique a la `uniq()`; `[ [ qw(foo foo) ] => [ ] ]` Repeats removed because they are subgroups of themselves. Either of these latter two behaviors can be achieved simply by changing the position and number of calls to `uniq()` within the `no_subgroups()` function. If you're interested in this, I can post further info. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Best method to eliminate substrings from array by AnomalousMonk (Archbishop) on Jun 26, 2019 at 17:13 UTC
Each string is a set of two or more values separated by pipes ("\|"). ... eliminate any strings that are substrings of other strings ... What's a "value"? Can you have a string `'AA\|BB\|CC`? If so, is `'A\|B'` a valid substring of it? (This would be true if testing with index.) Basically, I don't understand the concept of "substring." Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Best method to eliminate substrings from array by catemp (Initiate) on Jun 26, 2019 at 17:23 UTC
The values are actually part numbers so they might be something like 124182 or 3718317. The strings are lists of parts that are related to each other and when I generate the lists I might get a string of five parts and then another string containing those same five parts plus one more. I would want to only keep the string with the six parts because the five part string is essentially a substring of the six part string. Instead of substring you could say subset.	[reply]
Re: Best method to eliminate substrings from array by AnomalousMonk (Archbishop) on Jun 26, 2019 at 18:35 UTC
Please see Short, Self-Contained, Correct Example, and also and in particular How to ask better questions using Test::More and sample data. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re: Best method to eliminate substrings from array by choroba (Cardinal) on Jun 26, 2019 at 18:05 UTC
How many parts are there in the file? `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re: Best method to eliminate substrings from array by Anonymous Monk on Jun 26, 2019 at 17:17 UTC
Do all the values overlap with one big set or can there be disjoint sets? Are they always ordered? Can there be duplicates?	[reply]
Re^2: Best method to eliminate substrings from array by catemp (Initiate) on Jun 26, 2019 at 17:26 UTC
I am currently eliminating duplicates `@arrWorkingCompletedChains = uniq @arrWorkingCompletedChains;` and sorting the list largest to smallest `@arrWorkingCompletedChains = sort { length($b) <=> length($a) } @arrWorkingCompletedChains;` What I need to do is eliminate any overlapping string sets from the working array so the end result can be stored in our database	[reply] [d/l] [select]
Re^3: Best method to eliminate substrings from array by Anonymous Monk on Jun 26, 2019 at 17:39 UTC
you didn't answer one of the questions... more sample data plz	[reply]
Re^4: Best method to eliminate substrings from array by catemp (Initiate) on Jun 26, 2019 at 18:00 UTC


Welcome to the Monastery
	PerlMonks