Re: retain longest multi words units from hash

Here is my attempt at a solution (also trying my hand at POD (which I badly need to learn!)). Comments and corrections appreciated, as always.

Solution overview: This particular solution attempt takes all loaded entries, processing them in key length (smallest first) then string order. Each key is split on non-word characters. If no non-word characters are present, the entry is copied from the originating hash to the working hash. If one or more non-word characters are present, then the possible partial keys are generated. For each partial key, if an entry for the partial key exists in the working hash, it is deleted. Once the partial keys have been processed, the current entry is copied to the working hash.

Test results:

Original dataset:

%data = (
          'original' => {
                          'automation' => 2,
                          'automation technology' => 2,
                          'automation technology process' => 3,
                          'mass creation' => 2,
                          'rendition' => 3,
                          'saturation' => 3
                        },
          'resulting' => {
                         'automation technology process' => 3,
                         'mass creation' => 2,
                         'rendition' => 3,
                         'saturation' => 3
                       }
        );

Enhanced data set 1:

%data = (
          'original' => {
                          'automation' => 2,
                          'automation technology' => 2,
                          'automation technology process' => 3,
                          'bar' => 1,
                          'bar baz' => 1,
                          'bar baz quux' => 1,
                          'baz' => 1,
                          'baz quux' => 1,
                          'foo' => 1,
                          'foo bar' => 1,
                          'mass creation' => 2,
                          'quux' => 1,
                          'quuz' => 1,
                          'rendition' => 3,
                          'saturation' => 3
                        },
          'resulting' => {
                         'automation technology process' => 3,
                         'bar baz quux' => 1,
                         'foo bar' => 1,
                         'mass creation' => 2,
                         'quuz' => 1,
                         'rendition' => 3,
                         'saturation' => 3
                       }
        );

Enhanced data set 2:

%data = (
          'original' => {
                          'automation' => 2,
                          'automation technology' => 2,
                          'automation technology process' => 3,
                          'bar' => 2,
                          'bar baz' => 1,
                          'bar baz quux' => 1,
                          'bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'baz' => 1,
                          'baz quux' => 1,
                          'baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'foo' => 2,
                          'foo bar' => 1,
                          'foo-bar' => 1,
                          'foo-bar-baz' => 1,
                          'foo-bar-baz-qux' => 1,
                          'foo-bar-baz-qux-quux' => 1,
                          'foo-bar-baz-qux-quux-quuz' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault-garply' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy' => 1,
                          'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'fred-plugh-xyzzy-thud' => 1,
                          'grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'mass creation' => 2,
                          'plugh-xyzzy-thud' => 1,
                          'quux' => 1,
                          'quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'quuz' => 1,
                          'quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                          'rendition' => 3,
                          'saturation' => 3,
                          'thud' => 1,
                          'waldo-fred-plugh-xyzzy-thud' => 1,
                          'xyzzy-thud' => 1
                        },
          'resulting' => {
                         'automation technology process' => 3,
                         'bar baz quux' => 1,
                         'foo bar' => 1,
                         'foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud' => 1,
                         'mass creation' => 2,
                         'rendition' => 3,
                         'saturation' => 3
                       }
        );

Code:

#!/usr/bin/env perl

use strict;
use warnings;

use Data::Dumper;
use Getopt::Long;
use List::MoreUtils qw{ uniq };

# My preferences regarding Data::Dumper output.
$Data::Dumper::Deepcopy = 1;
$Data::Dumper::Sortkeys = 1;

$| = 1;

my $_DEBUG = 0;    # Toggle debug print statements

my %data;

# Load data from DATA handle
while ( my $line = <DATA> ) {
    chomp $line;

    $line =~ s/\# .* $//msx;    # Remove comments from data
    $line =~ s/ \s* $//msx;     # Remove trailing whitespace

    next if ( $line =~ m/^ \s* $/imsx );    # Exclude blank lines

    $data{original}{$line}++;
}

%{ $data{resulting} } =
  retain_longest_mw_units( %{ $data{original} } );

print Data::Dumper->Dump( [ \%data, ], [qw( *data )] ), qq{\n};

# ======================================
# Subroutines
# ======================================
sub retain_longest_mw_units {
    my (%dataset) = @_;

    my %workingset;

    my @seen_key =
      uniq sort { length $a <=> length $b || $a cmp $b }
      keys %dataset;
    print Data::Dumper->Dump( [ \@seen_key, ],
        [qw( *seen_key )] ), qq{\n}
      if ($_DEBUG);

    my @key_length;
    foreach my $str (@seen_key) {
        my $str_val = $dataset{$str};

        my @non_word = $str =~ m/(\W)/gmsx;
        print Data::Dumper->Dump(
            [ \$str, \@non_word, ],
            [qw( *str *non_word )]
          ),
          qq{\n}
          if ($_DEBUG);

        if ( !scalar @non_word ) {
            $workingset{$str} = $dataset{$str};
        }
        else {
            my @separator;
            my $lookup_start = 0;
            push @separator,
              { boln => 1, eoln => 0, idx => 0, position => 0, };
            foreach my $i ( 0 .. $#non_word ) {
                my $separator_idx = index $str, $non_word[$i],
                  $lookup_start;
                push @separator,
                  {
                    eoln      => 0,
                    idx       => scalar @separator,
                    character => $non_word[$i],
                    position  => $separator_idx,
                  };
                $lookup_start = $separator_idx + 1;
            }
            push @separator,
              {
                eoln => 1,
                idx  => scalar @separator,
                len  => length $str,
              };
            print Data::Dumper->Dump(
                [ \$str, \@separator, ],
                [qw( *str *separator )]
              ),
              qq{\n}
              if ($_DEBUG);

            my %intended = ();
            foreach my $i ( 0 .. $#separator ) {
                foreach my $j ( $i .. $#separator ) {
                    next if ( $i == $j );
                    next
                      if (  ( $i == 0 )
                        and ( $j == $#separator ) );

                    my $intended_key = q{};
                    my $start        = -1;
                    my $str_len      = -1;

                    if ( $separator[$i]{boln} ) {
                        $start = $separator[$i]{position};
                        $str_len =
                            $separator[$j]{eoln}
                          ? $separator[$j]{len}
                          : $separator[$j]{position};
                        $intended_key = substr $str, $start,
                          $str_len;
                    }
                    else {
                        $start   = $separator[$i]{position} + 1;
                        $str_len = (
                              $separator[$j]{eoln}
                            ? $separator[$j]{len}
                            : $separator[$j]{position}
                          ) -
                          $start;
                        $intended_key = substr $str, $start,
                          $str_len;
                    }

                    print sprintf join( qq{\t},
                        q{i: %d},
                        q{j: %d},
                        q{start: %d},
                        q{str_len: %d},
                        q{intended_key: %s},
                      )
                      . qq{\n},
                      $i, $j, $start, $str_len, $intended_key
                      if ($_DEBUG);
                    $intended{$intended_key}++;
                }
            }

            foreach my $k ( keys %intended ) {
                if ( exists $workingset{$k} ) {
                    delete $workingset{$k};
                }
            }
            $workingset{$str} = $dataset{$str};
            print Data::Dumper->Dump(
                [ \$str, \%intended, ],
                [qw( *str *intended )]
              ),
              qq{\n}
              if ($_DEBUG);

        }
    }

    print Data::Dumper->Dump(
        [ \@seen_key, \@key_length ],
        [qw( *seen_key *key_length )]
      ),
      qq{\n}
      if ($_DEBUG);

    return %workingset;
}

=pod

=head1 NAME

1219394-atcroft_attempt.pl - atcroft's attempt at the problem

=head1 SYNOPIS

./1219394-atcroft_attempt.pl

=head1 DESCRIPTION

=head2 HISTORY

A L<post|https://www.perlmonks.org/?node_id=1219394> was 
submitted 2018-07-28T00:42Z by someone not logged into the 
site (referred to as an I<Anonymous Monk>, and hereafter 
abbreviated I<AM>). The problem posed
was given a hash containing multi-word units and their 
frequency, they wanted to remove any units contained within a
longer unit.

The example provided by the AM was the following:

    # Input
    $VAR1 = {
          'rendition' => '3',
          'automation' => '2',
          'saturation' => '3',
          'mass creation' => 2,
          'automation technology' => 2
          'automation technology process' => 3
        };
    # Desired output
    $VAR1 = {
          'rendition' => '3',
          'saturation' => '3',
          'mass creation' => 2,
          'automation technology process' => 3
        };


This code attempts to answer that question.

=head1 DETAILS

L<LanX|https://www.perlmonks.org/?node_id=708738> was the first
L<response|https://www.perlmonks.org/?node_id=708738>, with the
suggestion of a partial solution by doing the following:

=over 4

=item C<loop over the keys>

=item C<split multi-words into a list>

=item C<if the list contains more than one word,
    delete single words in that list>

=back

He updated a little while later noting that his suggestion
missed multi-word duplicates. It was at after reading that post
(in oldest-first order) that this author set about trying to
accomplish the task himself.

This particular solution attempt takes all loaded entries,
processing them in key length (smallest first) then string
order. Each key is split on non-word characters. If no
non-word characters are present, the entry is copied from the
originating hash to the working hash. If one or more 
non-word characters are present, then the possible partial
keys are generated. For each partial key, if an entry for the
partial key exists in the working hash, it is deleted. Once the
partial keys have been processed, the current entry is copied
to the working hash.

=head1 DEPENDENCIES

This code uses the L<List::MoreUtils> uniq function, to avoid
having to copy one or roll my own.

=head1 BUGS

Assume they are likely present, and proceed appropriately. If
confirmed, please inform this author.

=over 4

=item It is likely more verbose than necessary.

=item It likely contains evidence of bad programming habits.

=item It likely contains evidence of bad programming style(s).

=item It breaks keys on non-word characters (which may be an
issue depending on the reader's expection regarding word 
characters).

=item It may contain other limitations and/or restrictions.

=back

=head1 AUTHOR

L<atcroft|https://www.perlmonks.org/?node_id=70929>

=head1 COPYRIGHT AND LICENSE

Copyright 2018 by atcroft

This code is free software; you may redistribute it and/or modify
it under the same terms as Perl itself.

=cut

__DATA__
# Original data (matches original counts)
rendition
automation
mass creation
automation technology
rendition
automation
saturation
mass creation
automation technology process
saturation
automation technology
automation technology process
saturation
rendition
automation technology process

# Included as part of Enhanced Data Sets 1 & 2
foo
bar
baz
quux
quuz
foo bar
bar baz
baz quux
bar baz quux

# Included as part of Enhanced Data Set 2
foo
bar
foo-bar
foo-bar-baz
foo-bar-baz-qux
foo-bar-baz-qux-quux
foo-bar-baz-qux-quux-quuz
foo-bar-baz-qux-quux-quuz-corge
foo-bar-baz-qux-quux-quuz-corge-grault
foo-bar-baz-qux-quux-quuz-corge-grault-garply
foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo
foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred
foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh
foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy
bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud
baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud
qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud
quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud
quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-thud
corge-grault-garply-waldo-fred-plugh-xyzzy-thud
grault-garply-waldo-fred-plugh-xyzzy-thud
waldo-fred-plugh-xyzzy-thud
fred-plugh-xyzzy-thud
plugh-xyzzy-thud
xyzzy-thud
thud

foo-bar-baz-qux-quux-quuz-corge-grault-garply-waldo-fred-plugh-xyzzy-t
+hud
[download]

Hope it helps.

Comment on Re: retain longest multi words units from hash Download Code


Your skill will accomplish what the force of many cannot
	PerlMonks